# The Humanities in the Digital: Beyond Critical Digital Humanities

Lorella Viola

# The Humanities in the Digital: Beyond Critical Digital Humanities

Lorella Viola

# The Humanities in the Digital: Beyond Critical Digital Humanities

Lorella Viola University of Luxembourg Esch-sur-Alzette, Luxembourg

#### ISBN 978-3-031-16949-6 ISBN 978-3-031-16950-2 (eBook) https://doi.org/10.1007/978-3-031-16950-2

© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specifc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affliations.

Cover illustration: © Alex Linch shutterstock.com

This Palgrave Macmillan imprint is published by the registered company Springer Nature Switzerland AG

The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

*Alla mia famiglia*

# FOREWORD

No single line of development leads to the intersection of digital methods and scholarship in the humanities. The mid-twentieth-century activity of Roberto Busa is frequently cited in origin stories of digital humanities. His collaboration with IBM was undertaken when Thomas Watson saw the potential for automating the Jesuit scholar's concordance of the works of St Thomas Aquinas.

But instruments essential to automation include intellectual and technological components with much longer histories. These connect to such mechanical precedents as the Pascaline, a calculating device named for its seventeenth-century inventor, the philosopher Blaise Pascal, and to the fnely crafted Jacquard punch cards used to direct shuttle movements in the nineteenth-century textile industry. In addition, a wide array of systems of formal logic, procedural mathematics and statistical methods developed over centuries have provided essential foundations for computational operations.

Many features of contemporary networked scholarship can be tracked to earlier information systems of knowledge management and even ancient strategies of record-keeping. Scholarly practices were always mediated through technologies and infrastructures whether these were hand-copied scrolls and codices, shelving systems and catalogues or other methods for search, retrieval and reproduction. The imprints of Babylonian grids, arithmetic visualisations, legacy metadata and classifcation schemes remain present across contemporary knowledge work. Now that the role of digital technology in scholarship has become conspicuous in any and every domain, the line between technological and humanistic domains is sometimes hard to discern in our daily habits. Even the least computationally savvy scholar regularly makes use of digital resources and activities. Of course, discrepancies arise cross cultures and geographies, and certainly not all scholarship in every community exists in the same networked environment. Plenty of 'traditional' scholarship continues in direct contact with physical artefacts of every kind. But much of what has been considered 'the humanities' within the long traditions of Western and Asian culture now functions on foundations of digitised infrastructure for access and distribution. To a lesser degree, it also functions with the assistance of computational processing that assists not only search and retrieval, but also analysis and presentation in quantitative and graphical display.

As we know, familiar technologies tend to become invisible as they increase in effciency. The functional device does not, generally, call attention to itself when it performs its tasks seamlessly. The values of technological optimisation privilege these qualities as virtues. Habitual consumers of streaming content are generally disinterested in a Brechtian experience of alienation meant to raise awareness of the conditions of viewing within a cultural-ideological matrix. Such interventions would be tedious and distracting and only lead to frustration whether part of an entertainment experience or a scholarly and pedagogical one. Consciousness raising through aesthetic work has gone the way of the early twentieth-century avant-garde 'slap in the face of public taste' and other tactics to shock the bourgeoisie. The humanities must stream along with the rest of content delivered on demand and in consumable form, preferably with special effects and in small packets of readily digested material. Immersive Van Gogh and Frida Kahlo exhibits have joined the traditional experience of looking at painting. The expectations of gallery viewers are now hyped by theme park standards. The gamifcation of classrooms and learning environments caters to attention-distracted participants. Humanities research struggles in such a context even as knowledge of history, ancient, indigenous and classical languages and expertise in scholarly methods such as bibliography and critical editing are increasingly rare.

Meanwhile, debates in digital humanities have become fractious in recent years with 'critical' approaches pitting themselves against earlier practices characterised as overly positivistic and reductive. Pushback and counterarguments have split the feld without resulting in any substantive innovation in computational methods, just a shift in rhetoric and claims. Algorithmic bias is easier to critique than change, even if advocates assert the need to do so. The question of whether statistical and formal methods imported from the social sciences can adequately serve humanistic scholarship remains open and contentious several decades into the use of now-established practices. Even exuberant early practitioners rarely promoted digital humanities as a unifed or salvifc approach to scholarly transformation, though they sometimes adopted computational techniques somewhat naively, using programs and platforms in a black box mode, unaware of their internal workings.

The instrumental aspects of digital infrastructure are only one topic of current dialogues. From the outset of my own encounter with digital humanities, sometime in the mid-1990s, what I found intriguing were the intellectual exigencies asserted as a requirement for working within the constraints of these formal systems. While all of this has become normative in the last decades, a quarter of a century ago, the recognition that much that had been implicit in humanities scholarship had to be made explicit within a digital context produced a certain frisson of excitement. The task of rethinking how we thought, of thinking in relation to different requirements, of learning to imagine algorithmic approaches to research, to conceptualise our explorations in terms of complex systems, emergent properties, probabilistic results and other frameworks infused our research with exciting possibilities.

As I have noted, much of that novelty has become normative—no longer called self-consciously to attention—just as awareness of the interface is lost in the familiar GUI screens. Still, new questions did get asked and answered through automated methods—mainly benefts at scale, the 'reading' of a corpus rather than a work, for instance. Innovative research crosses material sciences and the humanities, promotes non-invasive archaeology and supports authentifcation and attribution studies. Errors and mistakes abound, of course, and the misuse of platforms and processes is a regular feature of bad digital humanities—think of all those network diagrams whose structure is read as semantic even though it is produced to optimise screen legibility. Misreading and poor scholarship are hardly exclusive to digital projects even if the bases for claims to authority are structured differently in practices based on human versus machine interpretation.

Increased sophistication in automated processes (such as named entity recognition, part of speech parsing, visual feature analysis etc.) continues to refne results. But the challenge that remains is to learn to think in and with the technological premises. A digital project is not just an automated version of a traditional project; it is a project conceived from the outset in terms that structure the problems and possible outcomes according to what automated and computational processes enable. Using statistical sampling methods is not a machine-supported version of serendipity or chance encounter—it is a structurally and intellectually different activity. Photoshop is not just a camera on steroids; it is a means of abstracting visual information into discrete components that can be manipulated in ways that were not conceptually or physically possible in wet darkrooms. Similarly, other common programs like Gephi, Cytoscape, Voyant and Tableau contain conceptual features un-thought and unthinkable in analogue environments—but they need to be engaged with an understanding of what those features allow.

Detractors scoff, sceptics cringe and the naysayers of various critical stripes protest that all of this aligns with various agendas—political, neoliberal, free market or whatever—as if intellectual life had ever been free of conditions and contexts. Where were those pure humanities scholars of a bygone era? Working for the Church? The State? Elite universities? Administrative units within the legal systems of national power structures? The science labs that hatched nefarious outcomes in the name of pure research? Finger pointing and head wagging will not change the reality that the humanities have always been integrated into civilisations and cultures to serve partisan agendas and hegemonic power structures. Poetics and aesthetics provide insight into the conditions from which they arise; they are not independent of it. We no longer subscribe to the tenets of Matthew Arnold's belief in the 'best that has been made and thought' of human expression contributes to moral uplift and improvement. Everything is complicit. Digital humanities is hardly the frst or likely to be the last instrument of exclusivity or oppression—as well as liberation and social progress.

The labour of scholarship continues along with the pedagogy to sustain it. This activity imprints many values and judgements in the materials and methods on which it proceeds. Basic activities like classifcation model the way objects are found and identifed. Crucial decisions about digitisation size, scale, quality and source—affect what is presented for study. Terms of access and use create hierarchies among communities, some of whom have more resources than others. In short, at every point in the chain of interrelated social and material activities that create digital assets, implementation and intellectual implications are combined. The charge to address social ills and inequities freights projects in digital humanities with tasks of reparation and redress, asking that it bears the weight of an entire agenda of social justice. The relation between ethics and application to digital work raises more institutional resource and epistemological issues than technical ones. Fairness requires equal opportunity for skilled production and access, as well as a share in the interpretative discourse. No substitute exists for doing the work and having the educational and material resources to do it. The frst step in transforming a feld is the choice to acquire its competencies. Ignorant critique is as pernicious and ineffectual as unthinking practice.

Lorella Viola's argument about the current state of digital scholarship is meant to shift the frameworks for understanding these issues. Historical tensions are evident in the way her subtitle frames her work as 'Beyond Critical Digital Humanities'. Acknowledging debates that have often pitted frst wave digital humanists against later critics, she positions her own research as 'post-authentic' by contrast. This term signals dismissal of the last shred of belief that digital and computational techniques were value neutral or promoted objectivity (a position taken only by a fraction of practitioners). But it also distances her from the standard 'critique' of these methods—that they are tools of a neoliberal university environment promoting entrepreneurial approaches to scholarship at the expense of some other not very clearly articulated alternative (another very worn out line of discussion).

Keen to move beyond all this, Viola advocates 'symbiotic' and 'mutualist' as concepts that eschew many old binarisms and disciplinary boundaries. While acknowledging the range of work on which she herself is building, and the historical development of positions and counterpositions, she seeks to integrate critical principles into digital methods and projects from within her own experience of their practice. Her work is grounded in knowledge of text analysis and computational linguistics as well as interface design and visualisation. While her summary of polarised positions forms the opening section of this book, and underpins much of what she offers as an alternative, her vision of the way forward is synthetic and affrmative. Throughout, she invokes a post-authentic framework that emphasises critical engagement with digital operations as mediations. She frequently reiterates the points that geo-coding or sentiment analysis works within the dominant power structures that privilege Anglo-centric approaches and English language materials. Such biases are in part due to the historical site in which the work arose. Certain environments have more resources than others. The issue now is to create opportunities for transformation.

The larger assertions of Viola's project are crucial: that artifcial binarisms that pit traditional/analogue and computational/digital approaches against each other, critical methods against technical ones and so on are distractions from the core issues: How is humanities scholarship to proceed ahead? What intellectual expertise is required to work with and read into and through the processes and conditions in which we conceptualise the research we do?

Digital objects and computational processes have specifc qualities that are distinct from those of analogue ones, and learning to think in those modes is essential to conceptualising the questions these environments enable, while recognising their limitations. Thinking as an algorithm, not just 'using' technology, requires a shift of intellectual understanding. But knowledge of disciplinary domains and traditions remains a crucial part of the essential expertise of the scholar. Without subject area expertise, humanities research is pointless whether carried out with digital or traditional methods. As long as human beings are the main players—agents or participants—in humanities research, no substitute or surrogate for that expertise can arise. When that ceases to be the case, these questions and debates will no longer matter. No sane or ethical humanist would hasten the arrival of the moment when we cease to engage in human discourse. No matter how much agency and effcacy are imagined to emerge in the application of digital methods, or how deeply we may come to love our robo-pets and AI-bot-assistants, the humanities are still intimately and ultimately linked to human experience of being in the world. Finally, the challenge to infuse computational methods with humanistic values, such as the capacity to tolerate ambiguity and complexity, remains. What, after all, is a humanistic algorithm, a bias-sensitive digital format or a selfconscious interface? What interventions in the technology would result in these transformations?

Lorella Viola has much to say on these matters and works from experience that combines hands-on engagement with computational methods and a critical framework that advances insight and understanding. So, machine and human readers, turn your attention to her text.

Los Angeles, CA, USA Johanna Drucker May 2022

# PREFACE

In 2016, the Los Angeles Review of Books (LARB) conducted a series of interviews with both scholars and critics of digital humanities titled 'The Digital in the Humanities'. The aim of the special interview series was 'to explore the intersection of the digital and the humanities' (Dinsman 2016) and the impact of that intersection on teaching and research. As I was reading through the various interviews collected in the series, there was something that recurrently caught my attention. Despite an extensive use of terminology that attempted to communicate ideas of unity—for example, digital humanities is described as a feld that '*melds* computer science with hermeneutics'—it gradually became obvious to me how the traditional rigid notions of separation and dualism that characterise our contemporary model of knowledge creation were creeping in, surreptitiously but consistently. The following excerpt from the editorial to the special issue provides a good example (emphasis mine):

"digital humanities" seems astoundingly inappropriate for an area of study that includes, *on one hand*, computational research, digital reading and writing platforms, digital pedagogy, open-access publishing, augmented texts, and literary databases, and *on the other*, media archaeology and theories of networks, gaming, and wares both hard and soft. (ibid.)

Language is see-through. It is the functional description of our mental models, that is, it expresses our conceptual understanding of the world. In the example above, the use of the construction 'on one hand…on the other' conveys an image of two distinct, contrasting polarised entities which are essentially antithetical in their essence and which despite intersecting—as it is said a few lines below—remain fundamentally separate. This description of digital humanities refects a specifc mental model, that knowledge is made up of competences delimited by established, disciplinary boundaries. Should there be overlapping spaces, boundaries do not dissolve nor they merge; rather, disciplines further specialise and create yet new felds.

When the COVID-19 pandemic was at its peak, I was spending my days in my loft in Luxembourg and most of my activities were digital. I was working online, keeping contact with my family and friends online, watching the news online and taking online courses and online ftness classes. I even took part in an online choir project and in an online pub quiz. Of course for my friends, family and acquaintances, it was not much different. As the days became weeks and the weeks became months and then years, it was quite obvious that it was no longer a matter of having the digital in our lives; rather, now everyone was *in* the digital.

This book is titled 'The Humanities in the Digital' as an intentional reference to the LARB's interview series. With this title, I wanted to mark how the digital is now integral to society and its functioning, including how society produces knowledge and culture. The word order change wants to signal conclusively the obsolescence of binary modulations in relation to the digital which continue to suggest a division, for example, between digital knowledge production and non-digital knowledge production. Not only that. It is the argument of this book that dual notions of this kind are the spectre of a much deeper fracture, that which divides knowledge into disciplines and disciplines into two areas: the sciences and the humanities. This rigid conceptualisation of division and competition, I maintain, is complicit of having promoted a narrative which has paired computational methods with exactness and neutrality, rigour and authoritativeness whilst stigmatising consciousness and criticality as carriers of biases, unreliability and inequality.

This book argues against a compartmentalisation of knowledge and maintains that division in disciplines is not only unhelpful and conceptually limiting, but especially after the exponential digital acceleration brought about by the 2020 COVID-19 pandemic, also incompatible with the current reality. In the pages that follow, I analyse many of the different ways in which reality has been transformed by technology—the pervasive adoption of big data, the fetishisation of algorithms and automation and the digitisation of education and research—and I argue that the full digitisation of society, already well on its way before the COVID-19 pandemic but certainly brought to its non-reversible turning point by the 2020 health crisis, has added levels of complexity to reality that our model of knowledge based as it is on single-discipline perspective can no longer explain. With this book, my intention is to have the necessary conversation that the historical moment demands.

The book is therefore primarily a refection on the separation of knowledge into disciplines and of disciplines into the sciences vs the humanities and discusses its contemporary relevance and adequateness in relation to the ubiquitous impact of digital technologies on society and culture. In arguing in favour of a reconfgured model of knowledge creation in the digital, I propose different notions, practices and values theorised in a novel conceptual and methodological framework, the post-authentic framework. This framework offers a more complex and articulated conceptualisation of digital objects than the one found in dominant narratives which reduce them to mere collections of data points. Digital objects are understood as living compositions of humans, entities and processes interconnected according to various modulations of power embedded in computational processes, actors and societies. Countless versions can be created through such processes which are shaped by past actions and in turn shape the following ones; thus digital objects are never fnished nor they can be fnished ultimately transcending traditional questions of authenticity.

Digital objects act in and react to society and therefore bear consequences; the post-authentic framework rethinks both products and processes which are acknowledged as never neutral, incorporating external, situated systems of interpretation and management. Taking the humanities as a focal point, I analyse personal use cases in a variety of applied contexts such as digital heritage practices, digital linguistic injustice, critical digital literacy and critical digital visualisation. I examine how I addressed in my own work issues in digital practice such as transparency, documentation and reproducibility, questions about reliability, authenticity, biases, ambiguity and uncertainty and engaging with sources through technology. I discuss these case examples in the context of the novel conceptual and methodological framework that I propose, the post-authentic framework. By recognising the larger cultural relevance of digital objects and the methods to create them, analyse them and visualise them, throughout the chapters of the book, I show how the post-authentic framework affords an architecture for issues such as transparency, replicability, Open Access, sustainability, data manipulation, accountability and visual display.

*The Humanities in the Digital* ultimately aims to address the increasingly pressing questions: how do we create knowledge today? And how do we want the next generation of students to be trained? Beyond the rigid model of knowledge creation still fundamentally based on notions of separation and competition, the book shows another way: knowledge creation in the digital.

Esch-sur-Alzette, Luxembourg Lorella Viola July 2022

# ACKNOWLEDGEMENTS

*The Humanities in the Digital* is published Open Access thanks to support from three funding schemes: the *Fonds National de la Recherche Luxembourg* (FNR)—RESCOM: Scientifc Monographs scheme, the Luxembourg Centre for Contemporary and Digital History (C2DH)—Digital Research Axis Fund and the C2DH Open Access Fund. My sincerest thanks for funding this book project. Your support accelerates discovery and creates a fairer access to knowledge that is open to all.

The research carried out for this book was supported by FNR. Parts of the use cases illustrated in the book, including the conceptual work done towards developing the interface for topic modelling illustrated in Chap. 5, stem from research carried out within the project: DIGITAL HIS-TORY ADVANCED RESEARCH PROJECTS ACCELERATOR—DHARPA. The discussions about network analysis and sentiment analysis and the interface examples of DeXTER described in Chap. 5 are the result of research carried out within the THINKERING GRANT which was awarded to me by the C2DH.

I would like to express my deepest gratitude to Sean Takats for his support and continuous encouragement during these years; without it, writing this book would have been much harder. A big thank you to the DHARPA project as a whole which has been instrumental for elaborating the framework proposed in this book. I have been really lucky to be part of it.

I warmly thank Mariella de Crouy Chanel, my exchanges with you helped me identify and solve challenges in my work. Outside of offcial meetings, I have also much enjoyed our chats during lunch and coffee breaks. A very great thank you to Sean Takats, Machteld Venken, Andreas Musolff, Angela Cunningham, Joris van Eijnatten and Andreas Fickers for taking the time to comment on earlier versions of this book; thanks for being my critical readers. And thank you to the C2DH; in the Centre, I found fantastic colleagues and impressive expertise and resources. I could have not asked for more.

I am deeply grateful to Jaap Verheul and Joris van Eijnatten for providing me with invaluable intellectual support and guidance when I was taking my frst steps in digital humanities. You have both always shown respect for my ideas and for me as a professional and a learner and I will always cherish the memory of my time at Utrecht University.

My sincere thanks to the Transatlantic research project *Oceanic Exchanges* (OcEx) of which I had the privilege to be part. In those years, I had the opportunity to work with lead historians and computer scientists; without OcEx, I would not be the scholar that I am today.

Very special thanks to my family to which this book is dedicated for always listening, supporting me and encouraging me to aim high and never give up. Your love and care are my light through life; I love you with all my heart. And a very big thank you to my little nephew Emanuele, who thinks I must be very cold away from Sicily. Your drawings of us in the Sicilian sun have indeed warmed many cold, Luxembourgish winter days.

I heartily thank Johanna Drucker for writing the Foreword to this book and more importantly for producing inspiring and important science.

Thanks also to those who have supported this book project right from the start, including the anonymous reviewers who have provided insights and valuable comments, considerably contributing to improve my work.

And fnally, thanks to all those who, directly or indirectly, have been involved in this venture; in sometimes unpredictable ways, you have all contributed to make this book possible.

# CONTENTS


# ABOUT THE AUTHOR

**Lorella Viola** is research associate at the Luxembourg Centre for Contemporary and Digital History (C2DH), University of Luxembourg. She is co-Principal Investigator in the Luxembourg National Research Fund project DHARPA (Digital History Advanced Research Project Accelerator). She was previously research associate at Utrecht University where she was Work Package Leader in the Transatlantic digital humanities research project Oceanic Exchanges. Currently, she is co-editing the volume Multilingual Digital Humanities (Routledge) which brings together, advances and refects on recent work on the social and cultural relevance of multilingualism for digital knowledge production. Her scholarship has appeared among others in *Digital Scholarship in the Humanities*, *Frontiers in Artifcial Intelligence*, *International Journal of Humanities and Arts Computing* and *Reviews in Digital Humanities*.

# ACRONYMS



# LIST OF FIGURES



CHAPTER 1

# The Humanities *in* the digital

The ultimate, hidden truth of the world is that it is something that we make, and could just as easily make differently. (Graeber 2013)

## 1.1 IN THE DIGITAL

The digital transformation of society was saluted as the imperative, unstoppable revolution which would have provided unparalleled opportunities to our increasingly globalised societies. Among other benefts, it was praised for being able to accelerate innovation and economic growth, increase fexibility and productivity, reduce waste consumption, simplify and facilitate services and information provision and improve competitiveness by drastically reducing development time and cost (Komarˇcevi´c et al. 2017). At the same time, however, warnings about the dramatic and disruptive changes and outcomes that it would inevitably carry accompanied the considerable hype. For example, several economists raised serious concerns about the major risks that would derive from the digital transformation of society. A non-negligible number of evidence-based studies projected rise in social inequality, job loss and job insecurity, wage defation, increased polarisation in society, issues of environmental sustainability, local and global threats to security and privacy, decrease in trust, ethical questions on the use of data by organisations and governments and online profling, outdated regulations, issues of accountability in relation to algorithmic governance, erosion of the social security and intensifcation of isolation, anxiety, stress and exhaustion (e.g., Autor et al. 2003; Cook and Van Horn 2011; Hannak et al. 2014; Lacy and Rutqvist 2015; Weinelt 2016; Frey and Osborne 2017; Komarˇcevi´c et al. 2017; Schwab 2017).

Despite all the evidence, however, the extraordinary collective advantages presented by the new technologies were believed to far outweigh the risks (Weinelt 2016; Komarˇcevi´c et al. 2017; Schwab 2017). Indeed, the prevailing tendency was to describe these great dangers rather as 'challenges' which, however signifcant, were believed to be within governments' reach. The digital transformation of society would have undoubtedly provided unprecedented 'opportunities' to collaborate across geographies, sectors and disciplines, so naturally, on the whole, the highly praised positives of the digital revolution overshadowed the negatives. Some experts comment that this is in fact hardly surprising as in order for a revolution to be accomplished, the necessary support must be mobilised by governments, universities, research institutions, citizens and businesses (Komarˇcevi´c et al. 2017).

Thus, in the last decade, though with differences across countries, both the public and the private sector have embraced the digital transformation (European Center for Digital Competitiveness 2021). Governments around the world have increasingly implemented comprehensive technology-driven programmes and legal frameworks aimed at boosting innovation and entrepreneurship, whilst the industrial sector as a whole has invested massively in digitising business processes, work organisation and culture, modalities of market access, models of management and relationships with customers (ibid.). The digital transformation has then over the years forced businesses and governments to revolutionise their infrastructures to incorporate an effective and comprehensive digital strategy. Indeed, like always in history, the choice between adopting the new technology or not has quickly become rather between innovation and extinction.

The digital transformation has profoundly affected research as well. The incorporation of technology in scholarship practice and culture, the implementation of data-driven approaches and the size and complexity of usable and used data have increased exponentially in natural, computational, social science and humanities research. The 'Digital Turn', as it is called, has almost forced scholars to integrate advanced quantitative methods in their research, and in the humanities at large, it has, for example, led to the emergence of completely new felds such as of course digital humanities (DH) (Viola and Verheul 2020b).

Institutionally, universities have in contrast been slow to adapt. Although bringing the digital to education and research has been on higher education institutions' agendas for years, the changes have always been set to be implemented gradually over the span of several years. Universities have in other words adopted an *evolutionary approach* to the digital (Alenezi 2021), according to which digital benefts are incorporated within an existing model of knowledge creation. This means that, on the one hand, the integration of the digital into knowledge creation practices and the combination of methods and perspectives from different disciplines are highly encouraged and much praised as the most effective way to accelerate and expand knowledge. At the same time, however, technology and the digital are seen as entities somewhat separate or indeed separable from knowledge creation itself. This moderate approach allows a gradual pace of change, and it is generally praised for its capacity to minimise disruptions while at the same time allowing change (Komarˇcevi´c et al. 2017; Microsoft Partner Community 2018).

The reasons why universities have traditionally chosen this strategy are various and complex, but generally speaking they all have something in common. In his book *Learning Reimagined*, Graham Brown-Martin (2014) argues that the current model of education is still the same as the one that was set to prepare the industrial workforce of the nineteenthcentury factories. This model was designed to create workers who would do their job silently all day to produce identical products; collaboration, creativity and critical thinking were precisely what the model aimed to discourage. As this system has become less and less relevant over the years, it has become increasingly costly to replace the existing infrastructures, including to radically rethink teaching and learning practices and to redevise a new model of knowledge creation that would suit the higher education's mission while at the same time respond to the needs of the new digital information and knowledge landscape. Therefore, for higher education institutions, the preferred strategy has traditionally been to progressively integrate digital tools in their existing systems, as a means to advance educational practices whilst containing the exorbitant costs that a true revolution would entail, including the inevitable disruptive changes. After all, despite what the word 'revolution' may suggest, these complex and radical processes are painfully slow and always require years to be implemented. In fact, as the 'Gartner Hype Cycle' of technology1 indicates (Fenn and Raskino 2008), only some of these processes are actually expected to eventually reach the virtual status of 'Plateau of Productivity' and if there is a cost to adapting slowly, the cost to being wrong is higher.

The 2020 health crisis changed all this. In just a few months' time, the COVID-19 pandemic accelerated years of change in the functioning of society, including the way companies in all sectors operated. In 2020, the McKinsey Global Institute surveyed 800 executives from a wide variety of sectors based in the United States, Australia, Canada, China, France, Germany, India, Spain and the United Kingdom (Sua et al. 2020). The report showed that since the start of the pandemic, companies had accelerated the digitisation of both their internal and external operations by three to four years, while the share of digital or digitally enabled products in their portfolios had advanced by seven years. Crucially, the study also provided insights into the long-term effects of such changes: companies claimed that they were now investing in their long-term digital transformations more than in anything else. According to a BDO's report on the digital transformation brought about by the COVID crisis (Cohron et al. 2020, 2), just as much as businesses that had developed and implemented digital strategies prior to the pandemic were in a position to leapfrog their less digital competitors, organisations that would not adapt their digital capabilities for the post-coronavirus future would simply be surpassed.

Higher education has also been deeply affected. Before the COVID-19 crisis, higher education institutions would look at technology's strategic importance not as a critical component of their success but more as one piece of the pedagogical puzzle, useful both to achieve greater access and as a source of cost effciency. For example, many academics had never designed or delivered a course online, carried out online students' supervisions, served as online examiners and presented or attended an online conference, let alone organise one. According to the United Nations Educational, Scientifc and Cultural Organization (UNESCO), at the frst peak of the crisis in April 2020, more than 1.6 billion students around the world were affected by campus closures (UNESCO 2020). As oncampus learning was no longer possible, demands for online courses saw an unprecedented rise. Coursera, for example, experienced a 543% increase in new courses enrolments between mid-March and mid-May 2020 alone (DeVaney et al. 2020). Having to adapt quickly to the virtual switch—much more quickly than they had considered feasible before the outbreak—universities and higher education institutions were forced to implement some kind of temporary digital solutions to meet the demands of students, academics, researchers and support staff. In the peak of the pandemic, classes needed to be moved online practically overnight, and so did all sorts of academic interactions that would typically occur face-to-face: supervisions, meetings, seminars, workshops and conferences, to name but a few. Universities and research institutes didn't have much choice other than to respond rapidly. Thus, just like in the business sector, the shift towards digital channels had to happen fast as those institutions that did not promptly and successfully achieve the transition towards the digital were in high risk of reducing their competitiveness dramatically, and not just in the near-term.

The sudden accelerated digital shift by universities is one aspect of society's forced digital switch during 2020. Remote work, omnichannel commerce, digital content consumption, platformifcation and digital health solutions are also examples of how society was kept afoat by the migration to the digital during the pandemic. This is not the kind of process that can be fully reversed. On the contrary, the most signifcant changes such as remote working, online offerings and remote interactions are in fact the most likely to remain in the long term, at least in some hybrid form. According to the McKinsey Global Institute survey (op. cit.), because such changes refect new health and hygiene sensitivities, respondents were more than twice as likely to believe that there won't be a full return to pre-crisis norms at all. Similarly, higher education predictions concerning digital or digitally enhanced offerings anticipated that these were likely to stay even after the health crisis would be resolved. Dynamic and blended approaches are therefore likely to become the 'new normal' as they allow universities to minimise potential teaching and learning disruptions in case of emergency and more importantly, they can now be implemented at a moment's notice. Consequently, instructors are more and more required to reimagine their courses for an online format. The same goes for all the other aspects of a scholar's life such as conference presentations, seminars, workshops, supervisions and exams, as well as research-specifc tasks, including data gathering and analysis.

COVID-19 has fnally also changed the role of technology particularly with regard to its crucial function in universities' risk mitigation strategies. According to the 2020 Coursera guide for universities to build and scale online learning programmes, universities that today are investing heavily in their digital infrastructures will be able to seamlessly pivot through any crisis in the future (DeVaney et al. 2020, 1). Although the digitisation of society was already underway before the crisis, it is argued in these reports that the COVID-19 pandemic has marked a clear turning point of historic proportions for technology adoption for which the paradigm shift towards digitisation has been sharply accelerated.

Yet if during the health crisis companies and universities were forced to adopt similar digitisation strategies, almost three years after the start of the pandemic, now things between the two sectors look different again. To succeed and adapt to the demands of the new digital market, companies understood that in addition to investing massively in their digital infrastructures, they crucially also had to create new business models that replaced the existing ones which had simply become inadequate to respond to the rules dictated by new generations of customers and technologies. The digital transformation has therefore required a deeper transformation in the way businesses were structuring their organisations, thought of the market challenges and approached problem-solving (Morze and Strutynska 2021). In contrast, it appears that higher education has returned to look at technology as a means for incremental changes, once again as a way to enhance learning approaches or for cost reduction purposes, but its disruptive and truly revolutionary impact continues to be poorly understood and on the whole under-theorised (Branch et al. 2020; Alenezi 2021). For instance, although universities and research institutes have to various degrees digitised pedagogical approaches, added digital skills to their curricula and favoured the use and development of digital methods and tools for research and teaching, technology is still treated as something contextual, something that happens alongside knowledge creation.

Knowledge creation, however, happens *in* society. And while society has been radically transformed by technology which has in turn transformed culture and the way it creates it, universities continue to adopt an evolutionary approach to the digital (Alenezi 2021): more or less gradual adjustments are made to incorporate it but the existing model of knowledge creation is left essentially intact. The argument that I advance in this book is on the contrary that digitisation has involved a much greater change, a more fundamental shift for knowledge creation than the current model of knowledge production accommodates. This shift, I claim, has in fact been *in*—as opposed to *towards*—the digital. As societies are in the digital, one profound consequence of this shift is that research and knowledge are also in turn inevitably mediated by the digital to various degrees. As a bare minimum, for example, regardless of the discipline, a post-COVID researcher is someone able to embrace a broad set of digital tools effectively. Yet what this entails in terms of how knowledge production is now accordingly lived, reimagined, conceptualised, managed and shared has not yet been adequately explored, let alone formally addressed. In relation to knowledge creation, what I therefore argue for is a *revolutionary* rather than an *evolutionary* approach to the digital. Whereas an evolutionary approach to the digital extends the existing model of knowledge creation to incorporate the digital in some form of supporting role, a *revolutionary* approach calls for a different model which entirely reconceptualises the digital and how it affects the very practices of knowledge production. Indeed, claiming that the shift has been *in* the digital acknowledges conclusively that the digital is now integral to not only society and its functioning, but crucially also to how society produces knowledge and culture.

Crucially, such different model of knowledge production must break with the obsolescence of persisting binary modulations in relation to the digital—for example between digital knowledge creation and non-digital knowledge creation—in that they continue to suggest artifcial divisions. It is the argument of this book that dual notions of this kind are the spectre of a much deeper fracture, that which divides knowledge into disciplines and disciplines into two areas: the sciences and the humanities. Signifcantly, a consequence of the shift in the digital is that reality has been complexifed rather than simplifed. Many of the multiple levels of complexity that the digital brings to reality are so convoluted and unpredictable that the traditional model of knowledge creation based on single discipline perspectives and divisions is not only unhelpful and conceptually limiting, but especially after the exponential digital acceleration brought about by the 2020 COVID-19 pandemic, also incompatible with the current reality and no longer suited to understand and explain the ramifcations of this unpredictability.

In arguing against a compartmentalisation of knowledge which essentially disconnects rather than connecting expertise (Stehr and Weingart 2000), I maintain that the insistent rigid conceptualisation of division and competition is complicit of having promoted a narrative which has paired computational methods with exactness and neutrality, rigour and authoritativeness whilst stigmatising consciousness and criticality as carriers of biases, unreliability and inequality. The book is therefore primarily a refection on the separation of knowledge into disciplines and of disciplines into the sciences vs the humanities and discusses its contemporary relevance and adequateness in relation to the ubiquitous impact of digital technologies on society and culture. In the pages that follow, I analyse many of the different ways in which reality has been transformed by technology the pervasive adoption of big data, the fetishisation of algorithms and automation, the digitisation of education and research and the illusory, yet believed, promise of objectivism—and I argue that the full digitisation of society, already well on its way before the COVID-19 pandemic but certainly brought to its non-reversible turning point by the 2020 health crisis, has added even further complexity to reality, exacerbating existing fractures and disparities and posing new complex questions that urgently require a re-theorisation of the current model of knowledge creation in order to be tackled.

In advocating for a new model of knowledge production, the book frmly opposes notions of divisions, particularly a division of knowledge into monolithic disciplines. I contend that the recent events have brought into sharper focus how understanding knowledge in terms of discipline compartmentalisation is anachronistic and not equipped to encapsulate and explain society. The pandemic has ultimately called for a reconceptualisation of knowledge creation and practices which now must operate beyond outdated models of separation. In moving beyond the current rigid framework within which knowledge production still operates, I introduce different concepts and defnitions in reference to the digital, digital objects and practices of knowledge production in the digital, which break with dialectical principles of dualism and antagonism, including dichotomous notions of digital vs non-digital positions.

This book focuses on the humanities, the area of academic knowledge that had already undergone radical transformation by the digital in the last two decades. I start by retracing schisms in the feld between the humanities, the digital humanities (DH) and critical digital humanities (CDH); these are embedded, I argue, within the old dichotomy of sciences vs humanities and the persistent physics envy in our society and by extension, in research and academic knowledge. I especially challenge existing notions such as that of 'mainstream humanities' that characterise it as a feld that is seemingly non-digital but critical. I maintain that in the current landscape, conceptualisations of this kind have more the colour of a nostalgic invocation of a reality that no longer exists, perhaps as an attempt to reconstruct the core identity of a pre-digital scholar who now more than ever feels directly threatened by an aggressive *other*: the digital. Equally not relevant nor useful, I argue, is a further division of the humanities into DH and CDH. In pursuing this argumentation, I examine how, on the one hand, scholars arguing in favour of CDH claim that the distinction between digital and analogue is pointless; therefore, humanists must embrace the digital critically; on the other hand, by creating a new feld, i.e., CDH, they fall into the trap of factually perpetuating the very separation between digital and critical that they defne as no longer relevant.

In pursuing my case for a novel model of knowledge creation in the digital, throughout the book, I analyse personal use cases; specifcally, I examine how I have addressed in my own work issues in digital practice such as transparency, documentation and reproducibility, questions about reliability, authenticity and biases, and engaging with sources through technology. Across the various examples presented in the following chapters, this book demonstrates how a re-examination of digital knowledge creation can no longer be achieved from a distance, but only from the inside, that the digital is no longer contextual to knowledge creation but that knowledge is created in the digital. This auto-ethnographic and selfrefexive approach allows me to show how my practice as a humanist *in* the digital has evolved over time and through the development of different digital projects. My intention is not to simply confront algorithms as instruments of automation but to unpack 'the cultural forms emerging in their shadows' (Gillespie 2014, 168). Expanding on critical posthumanities theories (Braidotti 2017; Braidotti and Fuller 2019), to this aim I then develop a new framework for digital knowledge creation practices—the post-authentic framework (*cfr.* Chap. 2)—which critiques current positivistic and deterministic views and offers new concepts and methods to be applied to digital objects and to knowledge creation *in* the digital.

A little less than a decade ago, Berry and Dieter (2015) claimed that we were rapidly entering a world in which it was increasingly diffcult to fnd culture outside digital media. The major premise of this book is that especially after COVID-19, all information is now digital and even more, algorithms have become central nodes of knowledge and culture production with an increased capacity to shape society at large. I therefore maintain that universities and higher education institutions can no longer afford to consider the digital has something that is happening *to* knowledge creation. It is time to recognise that knowledge creation is happening *in* the digital. As digital vs non-digital positions have entirely lost relevance, we must recognise that the current model of knowledge grounded in rigid divisions is at best irrelevant and unhelpful and at worst artifcial and harmful. Scholars, researchers, universities and institutions have therefore a central role to play in assessing how digital knowledge is created not just today, but also for the purpose of future generations, and clear responsibilities to shoulder, those that come from being *in* the digital.

# 1.2 THE ALGORITHM MADE ME DO IT!

Computational technology such as artifcial intelligence (AI) can be thought in many ways to be like a 'Mechanical Turk'.2 The Mechanical Turk or simply 'The Turk' was a chess-playing machine constructed by Wolfgang von Kempelen in the late eighteenth century. The mechanism appeared to be able to play a game of chess against a human opponent completely by itself. The Turk was brought to various exhibitions and demonstrations around Europe and the Americas for over eighty years and won most of the games played, defeating opponents such as Napoleon Bonaparte and Benjamin Franklin. In reality, the Mechanical Turk was a complex, mechanical illusion that was in fact operated by a human chess master hiding inside the machine.

AI and technology can be thought in many ways to be like the Mechanical Turk whereby the choices and actions hidden from view only but create the illusion of both a fully autonomous process and impartial output. And just like the Mechanical Turk was celebrated and paraded, the 'Digital Turn' and its fow of data have been applauded and welcomed practically ubiquitously. Indeed, hyped up by the reassuring promises of neutrality, objectivity, fairness and accuracy held out by digital technology and data, both industry and academia have embraced the so-called big data revolution, data-sets that are so large and complex that no traditional software—let alone humans—would ever be able to analyse it. In 2017, IBM reported that more than 90% of the world's data had appeared in the two previous years alone. Today, in sectors such as healthcare, big data is being used to reduce healthcare costs for individuals, to improve the accuracy and the waiting time for diagnoses, to effectively avoid preventable diseases or to predict epidemic outbreaks. The market of big data analytics in healthcare has continually grown and not just since the COVID-19 pandemic. According to a 2020 report about big data in healthcare, the global big data healthcare analytics market was worth over \$14.7 billion in 2018, \$22.6 billion in 2019 and expected to be worth \$67.82 billion by 2025. A more recent projection in June 2020 estimated this growth to reach \$80.21 billion by 2026, exhibiting a CAGR3 of 27.5% (ResearchAndMarkets.com 2020).

Big data analytics has also been incorporated into the banking sector for tasks such as improving the accuracy of risk models used by banks and fnancial institutions. In credit management, banks use big data to detect fraud signals or to understand the customer behaviour from the analysis of investment patterns, shopping trends, motivation to invest and personal or fnancial background. According to recent predictions, the market of big data analytics in banking could rise to \$62.10 billion by 2025 (Flynn 2020). Ever larger and more complex data-sets are also used for law and order policy (e.g., predictive policing), for mapping user behaviour (e.g., social media), for recording speech (e.g., Alexa, Google Assistant, Siri) and for collecting and measuring the individual's physiological data, such as their heart rate, sleep patterns, blood pressure or skin conductance. And these are just a few examples.

More data and *therefore* more accuracy and freedom from subjectivity were also promised to research. Disciplines across scientifc domains have increasingly incorporated technology within their traditional workfows and developed advanced data-driven approaches to analyse ever larger and more complex data-sets. In the spirit of breaking the old schemes of opaque practices, it is the humanities, however, that has arguably been impacted the most by this explosion of data. Thanks to the endless fow of searchable material provided by the Digital Turn, now humanists could fnally change the fully hermeneutical tradition, believed to perpetuate discrimination and biases.

This looked like 'that noble dream' (Novick 1988). Millions of records of sources seemed to be just a click away. Any humanist scholar with a laptop and an Internet connection could potentially access them, explore them and analyse them. Even more revolutionising was the possibility to fnally be able to draw conclusions from *objective evidence* and so dismiss all accusations that the humanities was a feld of obscure, non-replicable methods. Through large quantities of 'data', humanists could now understand the past more wholly, draw more rigorous comparisons with the present and even predict the future. This 'DH moment', as it was called, was perfectly in line with a more global trend for which data was (and to a large extent still is) presumed to be accurate and unbiased, therefore more reliable and ultimately, fairer (Christin 2016). The 'DH promise' (Thomas 2014; Moretti 2016) was a promise of freedom, freedom from subjectivity, from unreliability, but more importantly from the supposed irrelevance of the humanities in a data-driven world. It was also soaked in positivistic hypes about the endless opportunities of data-driven research methods in general and for humanities research in particular, such as the artful deception of suddenly being able to access *everything* or the scientistic belief in data as being more reliable than sources.

Following this positivistic hype, however, the unquestioning belief in the endless possibilities and benefts of applying computational techniques for the good of society and research started to be harshly criticised for being false and unrealistic (*cfr.* Sect. 1.3). The alluring and reassuring promises of data neutrality, objectivity, fairness and accuracy have indeed been found illusory, algorithms and data-driven methods even more biased than the interpretative act itself (Dobson 2019) and, ironically, in desperate need of human judgement to not cause harm (Gillespie 2014).

Particularly the indiscriminate use of big data in domains of societal infuence such as bureaucracy, policy-making or policing has started to raise fundamental questions about democracy, ethics and accountability. For example, data companies hired by politicians all over the world have used questionable methods to mine the social media profles of voters to infuence election results through a technique called microtargeting that uses extremely targeted messages to infuence users' behaviour. Although it is true that this technique has proven highly effective for marketing purposes, the causality of political microtargeting remains largely under-researched and therefore it is still poorly understood. The fact remains, however, that the use of personal data collected without the user's knowledge or permission to build sophisticated profling models raises ethical and privacy issues. For example, in 2015, Cambridge Analytica acquired the personal data of about 87 million Facebook users without their explicit permission. Their data had been collected via the 270,000 Facebook users who had given the third-party app 'This Is Your Digital Life' access to information on their friends' network. Cambridge Analytica had acquired and used such data claiming it was exclusively for academic purposes; Facebook had then allowed the app to harvest data from the Facebook friends of the app's users which were subsequently used by Cambridge Analytica. In this way, although only 270,000 people had given permission to the app, data was in fact collected from 87 million users. This revealed a scary privacy and personal data management loophole in Facebook's privacy agreement; it raised serious concerns about how digital private information is collected, stored and shared not just by Facebook but by companies in general and how these opaque processes often leave unaware individuals completely powerless.

But it is not just tech giants and academic research that jumped on the suspicious big data and AI bandwagon; governments around the world have also been exploiting this technology for matters of governance, law enforcement and surveillance, such as blacklisting and the so-called predictive policing, a data-driven analytics method used by law enforcement departments to predict perpetrators, victims or locations of future crimes. Predictive policing software analyses large sets of historic and current crime data using machine learning (ML) algorithms to determine where and when to deploy police (i.e., place-based predictive policing) or to identify individuals who are allegedly more likely to commit or be a victim of a crime (i.e., person-based predictive policing). While supporters of predictive policing argue that these systems help predict future crimes more accurately and objectively than police's traditional methods, critics complain about the lack of transparency in how these systems actually work and are used and warn about the dangers of blindly trusting the supposed rigour of this technology. For example, in June 2020, Santa Cruz, California—one of the frst US cities to pilot this technology in 2011—was also the frst city in the United States to ban its municipal use. After nine years, the city of Santa Cruz decided to discontinue the programme over concerns of how it perpetuated racial inequality. The argument is that, as the data-sets used by these systems include only reported crimes, the obtained predictions are deeply fawed and biased and result in what could be seen as a self-fulflling prophecy. In this respect, Matthew Guariglia maintains that 'predictive policing is tailor-made to further victimize communities that are already overpoliced—namely, communities of colour, unhoused individuals, and immigrants—by using the cloak of scientifc legitimacy and the supposed unbiased nature of data' (Guariglia 2020). Despite other examples of predictive policing programmes being discontinued following audits and lawsuits, at the moment of writing, more than 150 cities in the United States have adopted predictive policing (Electronic Frontier Foundation 2021). Outside of the United States, China, Denmark, Germany, India, the Netherlands and the United Kingdom are also reported to have tested or deployed predictive policing tools.

The problem with predictive policing has little to do with intentionality and a lot to do with the limits of computation. Computer algorithms are a fnite list of instructions designed to perform a computational task in order to produce a result, i.e., an output of some kind. Each task is therefore performed based on a series of instructed assumptions which, far from being unbiased, are not only obfuscated by the complexity of the algorithm itself but also artfully hidden by the surrounding algorithmic discourse which socially legitimises its outputs as objective and reliable. The truth is, however, that computers are extremely effcient and fast at automating complex and lengthy processes but that they perform rather poorly when it comes to decision-making and judgement. In the words of Danah Boyd (2016, 231):

[…] if they [computers] are fed a pile of data and asked to identify correlations in that data, they will return an answer dependent solely on the data they know and the mathematical defnition of correlation that they are given. Computers do not know if the data they receive is wrong, biased, incomplete, or misleading. They do not know if the algorithm they are told to use has faws. They simply produce the output they are designed to produce based on the inputs they are given.

Boyd gives the example of a traffc violation: a red light run by someone who is drunk vs by someone who is experiencing a medical emergency. If the latter scenario is not embedded into the model as a specifc exception, then the algorithm will categorise both events as the same traffc violation. The crucial difference in decision-making processes between humans and algorithms is that humans are able to make a judgement based on a combination of factors such as regulations, use cases, guidelines and, fundamentally, environmental and contextual factors, whereas algorithms still have a hard time mimicking the nature of human understanding. Human understanding is fuid and circular, whilst algorithms are linear and rigid. Furthermore, the data-sets on which computational decisionmaking models are based are inevitably biased, incomplete and far from being accurate because they stem from the very same unequal, racist, sexist and biased systems and procedures that the introduction of computational decision-making was intended to prevent in the frst place.

Moreover, systems become increasingly complex and what might be perceived as one algorithm may in fact be many. Indeed, some systems can reach a level of complexity so deep that understanding the intricacies and processes according to which the algorithms perform the assigned tasks becomes problematic at best, if at all possible (Gillespie 2014). Although this may not always have serious consequences, it is nevertheless worth of close scrutiny, especially because today complex ML algorithms are used extensively, and more and more in systems that operate fundamental social functions such as the already cited healthcare and law and order, but as a matter of fact they are still 'poorly understood and under-theorized' (Boyd 2018). Despite the fact that they are assumed to be, and often advertised as being neutral, fair and accurate, each algorithm within these complex systems is in fact built according to a set of assumptions and cultural values that refect the strategic choices made by their creators according to specifc logics, may these be corporate or institutional.

Another largely distorted view surrounding digital and algorithmic discourse concerns data. Although algorithms and data are often thought to be two distinct entities independent from each other, they are in fact two sides of the same coin. In fact, to fully understand how an algorithm operates the way it does, one needs to look at it in combination with the data it uses, better yet at how the data must be prepared for the algorithm to function (Gillespie 2014). This is because in order for algorithms to work properly, that is automatically, information needs to be *rendered* into data, e.g., formalised according to categories that will defne the database records. This act of categorising is precisely where human intervention hides. Gillespie pointedly remarks that far from being a neutral and unbiased operation, categorisation is in fact an act of 'a powerful semantic and political intervention' (Gillespie 2014, 171), deciding what the categories are, what belongs in a category and what does not are all powerful worldview assertions. Database design can therefore have potentially enormous sociological implications which to date have largely been overlooked (ibid.).

A recent example of the larger repercussions of these powerful worldview assertions is fashion companies for people with disabilities and how their requests to be advertised by Facebook have been systematically rejected by Facebook's automated advertising centre. Again, the reason for the rejection is unlikely to have anything to do with intentionally discriminating against people with disabilities, but it is to be found in the way fashion products for people with disabilities are identifed (or rather misidentifed) by Facebook algorithms that determine products' compliance with Facebook policy. Specifcally, these items were categorised as 'medical and health care products and services including medical devices' and as such, they violated Facebook's commercial policy (Friedman 2021). Although these companies had their ads approved after appealing to Facebook's decision, episodes like this one reveal not only the deep cracks in ML models, but worse, the strong biases in society at large. To paraphrase Kate Crawford, every classifcation system in machine learning contains a worldview (Crawford 2021). In this particular case, the implicit bias in Facebook's database worldview would be that a person with disability is not believed to possibly have an interest in fashion as a form of self-expression.

Despite the growing evidence as well as statements of acknowledgement—'Raw Data is an oxymoron', Lisa Gitelman wrote in 2013 (Gitelman 2013)—in most of the public and academic discourse, data continues to be exalted as being exact and unarguable, mostly still thought of as a natural resource rather than a cultural, situated one. To the contrary, it is the uncritical use of data to make predictions in matters of welfare, homelessness, crime and child protection to name but a few which has created systems that are, in Virginia Eubanks' words, 'Automating Inequality' (2017). The immediate, profound and dangerous consequence of the indiscriminate use of automated systems is that the resulting decisions are remorselessly blamed on the targeted individual and justifed morally through the legitimisation of practices believed to be evidence-based, therefore accurate and unbiased. This is what Boyd calls 'dislocation of liability' (2016, 232) for which decision-makers are distanced from the humanity of those affected by automated procedures.

In this book, I advance a critique of the mainstream big data and algorithmic discourse which continues to fetishisise data as impartial and somewhat pre-existing and which obscures the subjective and interpretative dimension of collecting, selecting, categorising and aggregating, i.e., the act of *making* data. I argue that following the shift *in* the digital rapidly accelerated by the pandemic, a new set of notions, practices and values needs to be devised in order to re-fgure the way in which we conceptualise data, technology, digital objects and on the whole the process of digital knowledge creation. Drawing on posthumanist studies (Braidotti 2017; Braidotti and Fuller 2019; Braidotti 2019) and on recent theories of digital cultural heritage (Cameron 2021), to this end, I present a novel framework: the post-authentic framework. With this framework, I propose concepts, practices and values that recognise the larger cultural relevance of digital objects and the methods to create them, analyse them and visualise them. Signifcantly, the post-authentic framework problematises digital objects as unfnished, situated processes and acknowledges the limitations, biases and incompleteness of tools and methods adopted for their analysis in the process of digital knowledge creation. In this way, the framework ultimately introduces a counterbalancing narrative in the main positivist discourse that equals the removal of the human—which in any case is illusory—to the removal of biases. Indeed, as the promises of a newly found freedom from subjectivity are increasingly found to be false, the post-authentic framework acts as a reminder that in our own time, computational technology is like the Mechanical Turk of that earlier century.

Featuring a range of personal case studies and exploring a variety of applied contexts such as digital heritage practices, digital linguistic injustice, critical digital literacy and critical digital visualisation, I devote specifc attention to four key aspects of knowledge creation in the digital: creation of digital material, enrichment of digital material, analysis of digital material and visualisation of digital material. My intention is to show how contributions to working towards systemic change in research and by extension in society at large, can be implemented when collecting, assessing, reviewing, enriching, analysing and visualising digital material. Throughout the chapters, I use the post-authentic framework to discuss these various case examples and to show that it is only through the conscious awareness of the delusional belief in the neutrality of data, tools, methods, algorithms, infrastructures and processes (i.e., by acknowledging the human chess master hiding inside the Turk) that the embedded biases can be identifed and addressed.

My argument is closely related to the notion of 'originary technicity' (see, for instance, Heidegger 1977; Clark 1992; Derrida 1994; Beardsworth 1996; Stiegler 1998) which rejects the Aristotelian view of technology as merely utilitarian. Originary technicity claims that technology is not simply a tool that humans deploy for their own ends, because humans are always invested in the technology they develop. In this way, technology (e.g., AI and algorithms) becomes in turn a central node of knowledge and culture production and the knowledge and culture so produced shape humans and their vision of the world in a mutually reinforcing cycle. Culture is incorporated in technology as it is built by humans who then use technology to produce culture. Hence, as the very concept of an absolute objectivity when adopting computational techniques (or in general, for that matter) is an illusion, so are the notions of 'fully autonomous' or 'completely unbiased' processes. An uncritical approach to the use of computational methods, I maintain, not only simply reinforces the very old schemes of obscure practices that digital technology claims to break, but more importantly it can make society worse.

This is a reality that can no longer be ignored and which can only be confronted through a reconfguration of our model of knowledge creation. This re-examination would relinquish illusory positivistic notions and acknowledge digital processes as situated and partial, as an extremely convoluted assemblage of components which are themselves part of wider networks of other entities, processes and mechanisms of interaction. Broadly, the argument that I advance is that the current model of knowledge must be re-fgured to incorporate this critical awareness, ever more necessary in order to address the new challenges brought by the pandemic and the digital transformation of society. The shift *in* the digital has created a complexity that a model of knowledge supporting divisive positions (i.e., between on one side disciplines that are digital and therefore believed to be objective and on the other disciplines that are non-digital and therefore biased) cannot address.

I start my argument for an urgent knowledge reconceptualisation by building upon posthuman critical theory (Braidotti 2017) which argues that the matter 'is not organized in terms of dualistic mind/body oppositions, but rather as materially embedded and embodied subjects-inprocess' (16). In this regard, posthuman critical theory introduces the helpful notion of *monism* (*cfr.* Chap. 2), in which the power of differences is not denied but at the same time, it is not structured according to principles of oppositions, and therefore it does not function hierarchically (ibid.). A model of knowledge *in* the digital equally abandons dichotomous ideas that continue to be at the foundation of our conceptualisation of knowledge formation, such as digital vs non-digital positions, critical vs technological and, the biggest of all, that of the sciences vs the humanities.

# 1.3 A TALE OF TWO CULTURES

The hyper-specialisation of research that a discipline-based model of knowledge creation inevitably entails and how such a solid structure impedes rather than advancing knowledge has been debated in the academic forum for years (e.g., Klein 1983; Thompson Klein 2004; Chubin et al. 1986; Stehr and Weingart 2000; McCarty 2015). As the rigid organisation into disciplines has begun to dissolve over the course of the twenty-frst century, observers started to suggest that the existing model of knowledge production was increasingly inadequate to explain the world and that it was in fact modern society itself that was calling for its reconceptualisation. Weingart and Stehr (2000), for instance, proposed that 'one may have to add a postdisciplinary stage to the predisciplinary stage of the seventeenth and eighteenth centuries and the disciplinary stage of the nineteenth and twentieth centuries' (ibid., xii). At the same time, however, the undeniable amalgamation of disciplines was affecting areas of knowledge unevenly; authors noticed how, for example, in felds such as the natural sciences with a problem-solving orientation and where knowledge production is typically fast, boundaries between disciplines were much more blurred than in the humanities (ibid.).

The Digital Turn seemed to be capable of changing this tradition. The dynamic and disrupting essence of the digital on knowledge creation and on humanities scholarship in particular appeared to be correcting this unevenness and make the humanities interdisciplinary. Scholars observed how the digital was not only challenging and transforming structures of knowledge but that it was also creating new structures (e.g., digital humanities, digital history, digital cultural heritage) (Klein 2015; Cameron and Kenderdine 2007; Cameron 2007). The feld of DH, it was argued, would in this sense be 'naturally' interdisciplinary as it provides new methods and approaches which necessarily require new practices and new ways of collaborating. Another 'promise' of DH was that of being able to 'transform the core of the academy by refguring the labor needed for institutional reformation' (Klein 2015, 15).

After the initial enthusiasm and despite many examples around the world of interdisciplinary initiatives, academic programmes, departments and centres (Stehr and Weingart 2000; Deegan and McCarty 2011; Klein 2015), in twenty years, the rigid division into disciplines has however not changed much; it remains the persistent dominant model in use for knowledge production, and true collaboration is on the whole rare (Deegan and McCarty 2011, 2). Indeed, what these cases of interdisciplinarity show is a common trend: when disciplines share similar interests, rather than boundaries dissolving and merging as interdisciplinary discourse usually claims, what in fact tends to happen is that in order to respond to the new external challenges, disciplines further specialise and by leveraging their overlapping spaces, they create yet new felds. This modern phenomenon has been referred to as 'The paradox of interdisciplinarity' (Weingart 2000):

interdisciplinarity […] is proclaimed, demanded, hailed, and written into funding programs, but at the same time specialization in science goes on unhampered, refected in the continuous complaint about it. […] The prevailing strategy is to look for niches in uncharted territory, to avoid contradicting knowledge by insisting on disciplinary competence and its boundaries, to denounce knowledge that does not fall into this realm as 'undisciplined.' Thus, in the process of research, new and ever fner structures are constantly created as a result of this behaviour. This is (exceptions notwithstanding) the very essence of the innovation process, but it takes place primarily within disciplines, and it is judged by disciplinary criteria of validation. (Weingart 2000, 26–27)

The author argues that starting from the early nineteenth century when the separation and specialisation of science into different disciplines was created, interdisciplinarity became a promise, the promise of the unity of science which in the future would have been actualised by reducing the fragmentation into disciplines. Today, however, interdisciplinarity seems to have lost interest in that promise as the discourse has shifted from the idea of ultimate unity to that of innovation through a combination of variations (ibid., 41). For example, in his essay *Becoming Interdisciplinary*, McCarty (2015) draws a close parallel between the struggle of dealing with the post-World War II overwhelming amount of available research that inspired Vannevar Bush's Memex and the situation of contemporary researchers. Bush (1945) maintained that the investigator could not fnd time to deal with the increasing amount of research which had exceeded far beyond anyone's ability to make real use of the record. The diffculty was, in his view, that if on the one hand 'specialization becomes increasingly necessary for progress', on the other hand, 'the effort to bridge between disciplines is correspondingly superfcial.' The keyword on which we should focus our attention, McCarty argues, is *superfcial* (2015, 73):

Bush's geometrical metaphor (*superfcies*, having length or breadth without thickness), though undoubtedly intended as merely a common adjective, makes the point elaborated in another context by Richard Rorty (2004/2002): that the implicit model of knowledge at work here privileges singular truth at depth, reached by the increasingly narrower focus of disciplinary specialization, and correspondingly trivializes plenitude on the surface, and so the bridging of disciplines.

According to Rorty, being interdisciplinary does not mean looking for the one answer but going *superfcial*, i.e., wide, to collect multiple voices and multiple perspectives (2004). It has been argued, however, that true collaboration requires a more fundamental shift in the way knowledge creation is conceived than simply studying a common question or problem from different perspectives (van den Besselaar and Heimeriks 2001; Deegan and McCarty 2011). This would also include a deep understanding of disciplines and approaches other than one's own (Gooding 2020). Indeed, the contemporary notion of interdisciplinarity based on the idea that innovation is better achieved by recombining 'bits of knowledge from previously different felds' into novel felds is bound to create more specialisation and therefore new boundaries (Weingart 2000, 40).

The schism of the humanities between 'mainstream humanities' and digital humanities, and later between digital humanities and critical digital humanities, perfectly illustrates the issue. In 2012, Alan Liu wrote a provocative essay titled *Where Is Cultural Criticism in the DH?* (Liu 2012). The essay was essentially a plea for DH to embrace a wider engagement with the societal impact of technology. It was very much the author's hope that the plea would help to convert this 'defcit' into 'an opportunity', the opportunity being for DH to gain a long overdue full leadership, as opposed to a 'servant' role within the humanities. In other words, if the DH wanted to fnally become recognised as legitimate partners of 'mainstream humanities', they needed to incorporate cultural criticism in their practices and stop pushing buttons without refecting on the power of technology.

In the aftermath of Liu's essay, reactions varied greatly with views ranging from even harsher accusations towards DH to more optimistic perspectives, and some also offering fully programmatic and epistemological refections. Some scholars, for example, voiced strong concerns about the wider ramifcations of the lack of cultural critique in DH, what has often been referred to as 'the dark side of the digital humanities' (Grusin 2014; Chun et al. 2016), the association of DH with the 'corporatist restructuring of the humanities' (Weiskott 2017), neoliberalism (Allington et al. 2016), and white, middle-class, male dominance (Bianco 2012). Two controversial essays in particular, one published in 2016 by Allington et al. (op. cit.) and the other a year later by Brennan (2017) argued that, in a little over a decade, the myopic focus of DH on neoliberal tooling and distant reading had accomplished nothing but consistently pushing aside what has always been the primary locus of humanities investigation: intellectual practice.

This view was also echoed by Grimshaw (2018) who indicted DH for going to bed with digital capitalism, 'an online culture that is antidiversity and enriching a tiny group of predominantly young white men' (2). Unlike Weiskott (2017), however, who argued 'There is no such a thing as "the digital humanities"', meaning that DH is merely an opportunistic investment and a marketing ploy but it doesn't really alter the core of the humanities, Grimshaw maintained that this kind of pandering causes rot at the heart of the humanistic knowledge and practice. This he calls 'functionalist DH', the use of tools to produce information in line with managerial metrics but with no signifcant knowledge value (6). Grimshaw strongly criticises DH for having disappointed the promise of being a new discipline of emancipation and for being in fact 'nothing more than a tool for oppression'. The digital transformation of society, he continues, has resulted in increased inequality, wider economic gap, an upsurge in monopolies and surveillance, lack of transparency of big data, mobbing, trolling, online hate speech and misogyny. Rather than resisting it, DH is guilty of having embraced such culture, of operating within the framework of lucrative tech deals which perpetuate and reinforce the neoliberal establishment. Digital humanists are establishment curators and no longer able of critical thought; DH is therefore totally unequipped to rethink and criticise digital capitalism. Although he acknowledges the emergence of critical voices within DH, he also strongly advocates a more radical approach which would then justify the need for a 'new' feld, an additional space within the university where critique, opposition and resistance can happen (7). This space of resistance and critical engagement with digital capitalism is, he proposes, critical digital humanities (CDH).

Over the years, other authors such as Hitchcock (2013), Berry (2014) and Dobson (2019) have also advocated critical engagement with the digital as the epistemological imperative for digital humanists and have identifed CDH as the proper locus for such engagement to take place. For example, according to Hitchcock, humanists that use digital technology must 'confront the digital', meaning that they must refect on the contextual theoretical and philosophical aspects of the digital. For Berry, CDH practice would allow digital humanists to explore the relationship between critical theory and the digital and it would be both research- and practiceled. Equally for Dobson, digital humanists must endlessly question the cultural dimension and historical determination of the technical processes behind digital operations and tools. With perhaps the sole exception of Grimshaw (op. cit.) who is not interested in practice-led digital enquiry, the general consensus is on the urgency of conducting critically engaged digital work, that is, drawing from the very essence of the humanities, its intrinsic capacity to *critique*.

However, whilst these proposed methodologies do not differ dramatically across authors, there seems to be disagreement about the scope of the enquiry itself. In other words, the open question around CDH would not concern so much the *how* (nor the *why*) but the *what for?*. For example, Dobson (2019) is not interested in a critical engagement with the digital that aims to validate results; this would be a pointless exercise as the distinction between the subjectivity of an interpretative method and the objectivity of both data and computational methods is illusory. He claims (ibid., 46):

# **…**

there is no such thing as contextless quantitative data. […] Data are imagined, collected, and then typically segmented. […] We should doubt any attempt to claim objectivity based on the notion of bypassed subjectivity because human subjectivity lurks within all data. This is because data do not merely exist in the world, but are abstractions imagined and generated by humans. Not only that, but there always remain some criteria informing the selection of any quantity of data. This act of selection, the drawing of boundaries that names certain objects a data-set introduces the taint of the human and subjectivity into supposedly raw, untouched data.

As 'There is no such thing as the "unsupervised"' (ibid., 45), the aim of CDH is to thoroughly critique any claimed objectivity of all computational tools and methods, to be suspicious of presumed human-free approaches and to acknowledge that complete de-subjectifcation is impossible. The aim of CDH, he argues, is not to expand the set of questions in DH, like in Berry and Fagerjord's view (2017), but to challenge the very notion of a completely objective approach. In this sense, CDH is the endless search for a methodology, the very essence of humanistic enquiry.

Berry (2014) also starts from the assumption that the notion of objective data is illusory, however, he reaches opposite conclusions about what the aim of CDH is. For him and Fagerjord (2017), CDH would provide researchers with a space to conduct technologically engaged work, that is, work that uses technology but also draws on a vast range of theoretical approaches (e.g., software studies, critical code studies, cultural/critical political economy, media and cultural studies). This would allow scholars from many critical disciplines to tackle issues such as the historical context of any used technology and its theoretical limitations, including, for instance, a commitment to its political dimension. By doing so, CDH would address the criticism about the lack of cultural critique in DH and it would enrich DH with other forms of scholarly work (ibid., 175). In other words, by 'fxing' the lack of critical engagement of the feld, the function of CDH would be to strengthen DH, thus markedly diverging from Dobson.

Albeit from different epistemological points of view, these refections share similar methodological and ethical concerns and question the lack of critical engagement of DH, be they historical, cultural or political. I argue however that this reasoning exposes at least three inconsistencies. Firstly, in earlier perspectives (e.g., Liu 2012), the sciences are deemed to be *obviously* superior to the humanities and yet, as soon as the computational is incorporated into the feld, the value of the humanities seems to have decreased rather than increased. For example, Bianco (2012) advocates a change in the way digital humanists 'legitimise' and 'institutionalise' the adoption of computational practices in the humanities. Such change would require not simply defending the legitimacy or advocating the 'obvious' supremacy of computational practices but by reinvesting in the word *humanities* in DH. The supremacy of the digital would then be understood as a combination of superiority, dominance and relevance that computational practices—and by extension, the hard sciences (i.e., physics envy)—are believed to have over the humanities. However, as Grimshaw (2018) also argued later, in the process of incorporating the computational into their practices, the humanities forgot all about questions of power, domination, myth and exploitation and have become less and less like the humanities and more and more like a feld of execute button pushers. Despite acknowledging the illusion of subjectivity, this view shows how deeply rooted in the collective unconscious is the myth surrounding technology and science which frmly positions them as detached from human agency and distinctly separated from the humanities.

Secondly and following from the frst point, these views all share a persistent dualistic, opposing notion of knowledge, which in one form or another, under the disguise of either freshly coined or well-seasoned terms, continue to refect what Snow famously called 'the two cultures' of the humanities and the sciences (2013). Such separation is typically verbalised in competing concepts such as subjectivity vs objectivity, interpretative vs analytical and critical vs digital. Despite using terms that would suggest union (e.g., 'incorporated'), the two cultures remain therefore clearly divided. The conceptualisation of knowledge creation which continues to compartmentalise felds and disciplines, I argue, is also refected in the clear division between the humanities, DH and CDH. This model, I contend, is highly problematic because besides promoting intense schism, it inevitably leads disciplines to operate within a hierarchical, competitive structure in which they are far from equal. For example, Liu's critique mirrors the persistent dichotomy of science vs humanities: due to the lack of cultural criticism—typical of the sciences but not of the humanities—DH is not humanities at all. DH may be instrumental to the humanities (i.e., the humanities is superior to DH but inferior to the sciences), but it is reduced to a servant role. Hence, if typical descriptions of DH as a space in which the two worlds—the sciences and the humanities—'meld' seem to initially suggest a harmonious and egalitarian coexistence, in reality the way this relationship is interplayed is anything but.

The third contradiction refers to what Berry and Fagerjord (2017) (among others) point out in reference to the digital transformation of society that 'The question of whether something is or is not "digital" will be increasingly secondary as many forms of culture become mediated, produced, accessed, distributed or consumed through digital devices and technologies" (13). Humanists, they claim, must relinquish any comparative notion of digital vs analogue as this contrast 'no longer makes sense' (ibid., 28). What humanists need to do instead, they continue, is to refect critically on the computational and on the ramifcations of the computational in a dedicated space which, like Grimshaw and Dobson, they also suggest calling CDH, thus circling back to the second contradiction. If the humanities are critical and if the distinction between digital and analogue 'no longer makes sense', then by insisting on establishing a CDH, they fail to transcend the very same distinction between digital and analogue they claim it to be nonsensical.

While I see the validity and truth in the debates that have animated past DH scholarship, I also argue that the reason for these inconsistencies is to be found in the specifc model of knowledge within which these scholars still operate: a model in which knowledge is divided into competing disciplines. Behind the pushes to relinquish ideas of divisions and embrace the digital is a persistent disciplinary structure of knowledge which, despite the declared novelty, is bound to the epistemology of the last century. Instead, I maintain, we should not accommodate the digital within the existing disciplinary structure as it is the structure of knowledge itself and its conceptualisation into separate felds and worldviews that has to change. The current model of knowledge creation, grounded in division and competition, is unequipped to explain the complexities of the world and the 2020 pandemic has magnifed the urgency of adopting a strong critical stance on the digital transformation of society. This cannot happen through the creation of niche felds, let alone exclusively within the humanities, but through a reconceptualisation of knowledge creation itself.

The post-authentic framework that I propose in this book moves beyond the existing breakdown of disciplines which I see as not only unhelpful and conceptually limiting but also harmful. The main argument of this book is that it is no longer solely the question of how the digital affects the humanities but how knowledge creation more broadly happens in the digital. Thinking in terms of yet another feld (e.g., CDH) where supposedly computational science and critical enquiry would meet in this or that modulation, for this or that goal, still reiterates the same boundaries that hinder that enquiry. Similarly, claiming that DH scholarship conducts digital enquiry suggests that humanities scholarship does not happen in the digital and therefore it continually reproduces the outmoded distinction between digital and analogue as well as the dichotomy between digital/non-critical and non-digital/critical. Conversely, calls for a CDH presuppose that DH is never critical (or worse, that it cannot be critical at all) and that the humanities can (should?) continue to defer their appointment with the digital, and disregard any matter of concern that has to do with it, ultimately implying that to remain unconcerned by the digital is still possible.

But the digital affects us all, including (perhaps especially) those who do not have access to it. The digital transformation exacerbates the already existing inequalities in society as those who are the most vulnerable such as migrants, refugees, internally displaced persons, older persons, young people, children, women, persons with disabilities, rural populations and indigenous peoples are disproportionately affected by the lack of digital access. The digital lens provided by the 2020 pandemic has therefore magnifed the inequality and unfairness that are deeply rooted in our societies. In this respect, for example, on 18 July 2020, UN Secretary-General Antonio Guterres declared (United Nations 2020a):

COVID-19 has been likened to an x-ray, revealing fractures in the fragile skeleton of the societies we have built. It is exposing fallacies and falsehoods everywhere: the lie that free markets can deliver healthcare for all; the fction that unpaid care work is not work; the delusion that we live in a post-racist world; the myth that we are all in the same boat. While we are all foating on the same sea, it's clear that some are in super yachts, while others are clinging to the drifting debris.

The post-authentic framework that I propose in this book is a conceptual framework for knowledge creation in the digital; it rejects the view of the digital as crossing paths with disciplines, intersecting, melting, merging, meeting or any other verb that suggests that separate entities are converging but which leave the model of knowledge essentially unaffected. I maintain that this sort of worldview is obsolete, even dangerous; researchers can no longer justify statements such as 'I'm not digital' as we are all *in* the digital. But rather than seeing this transformation as a threat, some sort of bleak reality in which critical thinking no longer has a voice and everything is automated, I see it as an opportunity for change of historic proportion. Any process of transformation fundamentally changes all the parts involved; if we accept the notion of digital transformation with regard to society, we also have to acknowledge that as much as the digital transforms society, the way society produces knowledge must also be transformed. This entails acknowledging the unsuitability of current frameworks of knowledge creation for understanding the deep implications of technology on culture and knowledge and for meeting the world challenges complexifed by the digital. This book wants to signal how the digital acceleration brought by the 2020 events now adds new urgency to an issue already identifed by scholars some twenty years ago but that now cannot be procrastinated any further. Hall for instance argued (2002, 128):

We cannot rely merely on the modern "disciplinary" methods and frameworks of knowledge in order to think and interpret the transformative effect new technology is having on our culture, since it is precisely these methods and frameworks that new technology requires us to rethink.

I therefore suggest we stop using the term 'interdisciplinarity' altogether. As it contains the word *discipline*, albeit in reference to breaking, crossing, transcending disciplines' boundaries and all the other usual suspects that typically recur in interdisciplinarity discourse, I believe that the term continues to refer to the exact same notions of knowledge compartmentalisation that the digital transformation requires us to relinquish. In my view, thinking in these terms is not helpful and does not adequately respond to the consequences of the digital transformation that society, higher education and research have undertaken. Based on separateness and individualism, the current model of knowledge creation restricts our ability to identify and access the various complexities of reality. Traditional binary views of deep/signifcant vs superfcial/trivial, digital/non-critical vs non-digital/critical and the sciences vs the humanities may appear frm, but only because we exaggerate their fxity. Similarly, the separation into disciplines may seem inevitable and fxed, but in reality the majority of norms and views are arbitrary, neither unavoidable nor fnal and, therefore, completely alterable. Weingart, for instance, states (Weingart 2000, 39):

The structures are by no means fxed and irreplaceable, but they are social constructs, products of long and complex social interactions, subject to social processes that involve vested interests, argumentation, modes of conviction, and differential perceptions and communications.

With specifc reference to the current model of knowledge creation, for example, Stichweh (2001) reminds us that the organisation of universities in academic departments is rather a recent phenomenon, 'an invention of nineteenth century society' (13727); in fact, to paraphrase McKeon, the apparently monolithic integrity of disciplines as we know them may sometimes obscure a radically disparate and interdisciplinary core (1994). The argument I reiterate in this book is that the current landscape requires us to move from this model, beyond (not away from) thick description of single-discipline case studies, and to recognise not only that knowledge is much more fuid than we are accustomed to think, but also that the digital transcends artifcial discipline boundaries.

In the chapters that follow, I take an auto-ethnographic and self-refexive approach to show how the application of the post-authentic framework that I have developed has informed my practice as a humanist *in* the digital. More broadly, I show how the framework can guide a conceptualisation of knowledge creation that transcends discipline boundaries, especially digital vs non-digital positions. Thinking in terms of *in* the digital—and no longer *and* the digital—thus bears enormous potential for tangibly undisciplining knowledge, for introducing counter-narratives in the digital capitalistic discourse, for developing, encouraging and spreading a digital conscience and for taking an active part in the re-imagination of postauthentic higher education and research. The world has entered a new dimension in which knowledge can no longer afford to see technology and its production simply as instrumental and contextual or as an object of critique, admiration, fear or envy. In my view, the current landscape is much more complex and has now much wider implications than those identifed so far. In this book, I want to elaborate on them, not with the purpose of rejecting previous positions but to provide additional perspectives which I think are urgently required especially as a consequence of the 2020 pandemic.

In what still is predominantly a binary conceptual framework, e.g., the sciences vs the humanities, the humanities vs DH and DH vs CDH, this book provides a third way: knowledge creation in the digital. The book argues that the new paradigm shift *in* the digital—as opposed to *towards*—accelerated considerably by the COVID-19 pandemic positions knowledge creation beyond such outdated dichotomous conceptualisations. We develop technology at a blistering pace, but so does our capacity to misuse it, abuse it and do harm. It is therefore everyone's duty to argue against any claimed computational neutrality but more importantly to relinquish outmoded and rather presumptuous perspectives that grant solely to humanists the moral monopoly right to criticise and critique. Indeed, as we are all in the digital, critical engagement cannot afford to remain limited exclusively to a handful of scholars who may or may not have interest in practice-led digital research—but who are in the digital nevertheless—as this would tragically create more fragmentation, polarisation and ultimately harm.

This is not a book about CDH, neither is it a book about DH, nor is it about the digital *and* the humanities or the digital *in* the humanities. What this book is about is knowledge *in* the digital.

## 1.4 OH, THE PLACES YOU'LL GO!

The digital transformation of society—and therefore of academia and of knowledge creation more generally—will not be stopped, let alone reversed. The claim I advance in this book is that, whilst a great deal of talk has so far revolved around the impact of the digital on individual felds, how the model of knowledge creation should be transformed accordingly has largely been overlooked. I argue that the increasing complexity of the world brought about by the digital transformation now demands a new model of knowledge to understand, explain and respond to the reality of ubiquitous digital data, algorithmic automated processes, computational infrastructures, digital platforms and digital objects. I contend that such engagement should not unfold as coming from a place of criticism per se but that it should be seized as a historic opportunity for truly decompartmentalising knowledge and reconfguring the way we think about it. A decompartmentalised model of knowledge does not denature disciplines but it breaks the current opposing, hierarchical structure in which disciplines still operate. The digital transformation fnally forces us to go back to the fundamental questions: how do we create knowledge and how do we want to train our next generation of students?

Be it in the form of data, platforms, infrastructures or tools, across the humanities, scholars have pointed out the interfering nature of the digital at different levels and have called for a reconfguration of research practice conceptualisations (e.g., Cameron and Kenderdine 2007; Drucker 2011, 2020; Braidotti 2019; Cameron 2021; Fickers 2022). Fickers, for instance, proposes digital hermeneutics as a helpful framework to address both the archival and historiographical issues 'raised by changing logics of storage, new heuristics of retrieval, and methods of analysis and interpretation of digitized data' (2020, 161). In this sense, the digital hermeneutics framework combines critical refection on historical practice as well as digital literacy, for instance by embedding digital source criticism, a refection on the consequences for the epistemology of history of the transformation from sources to data through digitisation.

With specifc reference to cultural heritage concepts and their relation to the digital, Cameron (2007; 2021) refgures digital cultural heritage curation practices and digital museology by problematising digital cultural heritage as societal data, entities with their own forms of agency, intelligence and cognition (Cameron 2021). By refecting on the wider consequences of the digital on heritage for future generations including Western perspectives, climate change, environmental destruction and injustice, the scholar proposes a more-than-human digital museology framework which recognises the impact of AI, automated systems and infrastructures as part of a wider ecology of components in digital cultural heritage practices.

On the mediating role of the digital for the visual representation of material destined to humanistic enquiry, Drucker (2004; 2011; 2013; 2014; 2020) has also long advocated a critical stance and a more problematised approach. She has, for example proposed alternative ways of visualising digital material that would expose rather than hiding the different stages of mediation, interpretation, selection and categorisation that typically disappear in the fnal graphical display. Her work introduces an important counter-narrative in the public and academic discourse which predominantly exalts data, computational processes and digital visualisations as unarguable and exact.

These contributions are all unmistakable signs of the decreasing relevance of the current model of knowledge production following the digital transformation of society and of the fact that the notion that the digital is something that 'happens' to knowledge creation is entirely anachronistic now. At the same time, however, these past approaches insist on disciplinary competence and indeed are modulated primarily within the felds and for the disciplines they originate from (e.g., digital history, digital cultural heritage, the humanities). The post-authentic framework that I propose here attempts to break with the 'paradox of interdisciplinarity' in relation to the digital, for which knowledge is not truly undisciplined but the digital is incorporated in existing felds and creates yet new felds, hence new boundaries. The post-authentic framework incorporates all these recent perspectives but at the same time it goes beyond them; as it intentionally refers to digital objects rather than to the disciplines within which they are created, it provides an architecture for issues such as transparency, replicability, Open Access, sustainability, accountability and visual display with no specifc reference to any discipline.

I build my argument for advocating the post-authentic framework to digital knowledge creation and digital objects upon recent theories of critical posthumanities (Braidotti 2017; Braidotti and Fuller 2019). In recognising that current terminologies and methods for posthuman knowledge production are inadequate, critical posthumanities offers a more holistic perspective on knowledge creation, and it is therefore particularly relevant to the argument I advance in this book. With specifc reference to the need for novel notions that may guide a reconceptualisation of knowledge creation, Braidotti and Fuller (Braidotti 2017; Braidotti and Fuller 2019) propose *Transversal Posthumanities*, a theoretical framework for the Critical Posthumanities. With this framework, they introduce the concept of *transversality*, a term borrowed from geometry that refers to the understanding of spaces in terms of their intersection (Braidotti and Fuller 2019, 1). Although the main argument I advance in this book is also that of an urgent need for knowledge reconfguration, I maintain that *transversality* still suggests a view of knowledge as solid and thus it only partially breaks with the outdated conceptualisation of discipline compartmentalisation that aims to relinquish. To actualise a remodelling of knowledge, I introduce two concepts: *symbiosis* and *mutualism*. In Chap. 2, I explain how the notion of symbiosis—from Greek 'living together' embeds in itself the principle of knowledge as fuid and inseparable. Similarly, borrowed from biology, the notion of mutualism proposes that

areas of knowledge do not compete against each other but beneft from a mutually compensating relationship. Building on the notion of *monism* in posthuman theory (Braidotti and Fuller 2019, 16) (*cfr.* Sect. 1.2) in which differences are not denied but which at the same time do not function hierarchically, symbiosis and mutualism help refgure our understanding of knowledge creation not as a space of confict and competition but as a space of fuid interactions in which differences are understood as mutually enriching.

Symbiosis and mutualism are central concepts of the post-authentic framework that I propose in this book, a theoretical framework for knowledge creation *in* the digital. If collaboration across areas of knowledge has so far been largely an option, often motivated more by a grant-seeking logic than by genuine curiosity, the digital calls for an actual change in knowledge culture. The question we should ask ourselves is not 'How can we *collaborate*?' but 'How can we *contribute* to each other?'. Concepts such as those of symbiosis and mutualism could equally inform our answer when asking ourselves the question 'How do we want to create knowledge and how do we want to train our next generation of students?'.

To answer this question, the post-authentic framework starts by reconceptualising digital objects as much more complex entities than just collections of data points. Digital objects are understood as the confation of humans, entities and processes connected to each other according to the various forms of power embedded in computational processes and beyond and which therefore bear consequences (Cameron 2021). As such, digital objects transcend traditional questions of authenticity because digital objects are never fnished nor they can be fnished. Countless versions can continuously be created through processes that are shaped by past actions and in turn shape the following ones. Thus, in the postauthentic framework, the emphasis is on both products and processes which are acknowledged as never neutral and as incorporating external, situated systems of interpretation and management. Specifcally, I take digitised cultural heritage material as an illustrative case of a digital object and I demonstrate how the post-authentic framework can be applied to knowledge creation in the digital. Throughout the chapters of this book, I devote specifc attention to four key aspects of knowledge creation *in* the digital: creation of digital material in Chap. 2, enrichment of digital material in Chap. 3, analysis of digital material in Chap. 4, and visualisation of digital material in Chap. 5.

The second content chapter, Chap. 3, focuses on the application of the post-authentic framework to the task of enriching digital material; I use DeXTER – DeepteXTminER4 and *ChroniclItaly 3.0* (Viola and Fiscarelli 2021a) as case examples. DeXTER is a workfow that implements deep learning techniques to contextually augment digital textual material; *ChroniclItaly 3.0* is a digital heritage collection of Italian American newspapers published in the United States between 1898 and 1936. In the chapter, I show how symbiosis and mutualism have guided each action of DeXTER's enrichment workfow, from pre-processing to data augmentation. My aim is to exemplify how the post-authentic framework can guide interaction with the digital not as a strategic (grant-oriented) or instrumental (task-oriented) collaboration but as a cognitive mutual *contribution*. I end the chapter arguing that the task of augmenting information of cultural heritage material holds the responsibility of building a source of knowledge for current and future generations. In particular, the use of methods such as named entity recognition (NER), geolocation, and sentiment analysis (SA) requires a thorough understanding of the assumptions behind these techniques, constant update and critical supervision. In the chapter, I specifcally discuss the ambiguities and uncertainties of these methods and I show how the post-authentic framework can help address these challenges.

In Chap. 4, I illustrate how the post-authentic framework can be applied to the analysis of a digital object through the example of topic modelling, a distant reading method born in computer science and widely used in the humanities to mine large textual repositories. In particular, I highlight how through the deep understanding of the assemblage of culture and technology in the software, the post-authentic framework can guide us towards exploring, questioning and challenging the interpretative potential of computation. Drawing on the mathematical concepts of discrete vs continuous modelling of information, in the chapter I refect on the implications for knowledge creation of the transformation of continuous material into discrete form, binary sequences of 0s and 1s, and I especially focus on the notions of causality and correlations. I then illustrate the example of topic modelling as a computational technique that treats continuous material such as a collection of texts as discrete data. I bring critical attention to problematic aspects of topic modelling that are highly dependent on the sources: pre-processing, corpus preparation and deciding on the number of topics. The topic modelling example ultimately shows how post-authentic knowledge creation can be achieved through a sustained engagement with software, also in the form of a continuous exchange between processes and sources. Guided by symbiosis and mutualism, such dialogue maintains the interconnection between two parallel goals: output—any processed information—and outcome, the value resulting from the output (Patton 2015).

Operating within the post-authentic framework crucially means acknowledging digital objects as having far-reaching, unpredictable consequences; as the complex pattern of interrelationships among processes and actors continually changes, interventions and processes must always be critically supervised. One such process is the provision of access to digital material through visualisation. In Chap. 5, I argue that the post-authentic framework can help highlight the intrinsic dynamic, situated, interpreted and partial nature of computational processes and digital objects. Thus, whilst appreciating the benefts of visualising digital material, the framework rejects an uncritical adoption of digital methods and it opposes the main discourse that still presents graphical techniques and outputs as exact, fnal, unbiased and true. In the chapter, I illustrate how the post-authentic framework can be applied to the visualisation of cultural heritage material by discussing two examples: efforts towards the development of a user interface (UI) for topic modelling and the design choices for developing the app DeXTER, the interactive visualisation interface that explores *ChroniclItaly 3.0*. Specifcally, I present work done towards visualising the ambiguities and uncertainties of topic modelling, network analysis (NA) and SA, and I show how key concepts and methods of the post-authentic framework can be applied to digital knowledge visualisation practices. I centre my argumentation on how the acknowledgement of curatorial practices as manipulative interventions can be encoded in the interface. I end the discussion by arguing that it is in fact through the interface display of the ambiguities and uncertainties of these methods that the active and critical participation of the researcher is acknowledged as required, keeping digital knowledge honest and accountable.

In the fnal chapter, Chap. 6, I review the main formulations of this book project and I retrace the key concepts and values at the foundation of the post-authentic framework proposed here. I end the chapter with a few additional propositions for remodelling the process of digital knowledge production that could be adopted to inform the restructurin of academic and higher education programmes.

# NOTES


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# The Importance of Being Digital

A perspective is by nature limited. It offers us one single vision of a landscape. Only when complementary views of the same reality combine are we capable of achieving fuller access to the knowledge of things. The more complex the object we are attempting to apprehend, the more important it is to have different sets of eyes, so that these rays of light converge and we can see the One through the many. That is the nature of true vision: it brings together already known points of view and shows others hitherto unknown, allowing us to understand that all are, in actuality, part of the same thing. (Grothendieck 1986)

# 2.1 AUTHENTICITY, COMPLETENESS AND THE DIGITAL

For the past twenty years, digital tools, technologies and infrastructures have played an increasingly determining role in framing how digital objects are understood, preserved, managed, maintained and shared. Even in traditionally object-centred sectors such as cultural heritage, digitisation has become the norm: heritage institutions such as archives, libraries, museums and galleries continuously digitise huge quantities of heritage material. The most offcial indication of this shift towards the digital in cultural heritage is perhaps provided by UNESCO which, in 2003, recognised that the world's documentary heritage was increasingly produced, distributed, accessed and maintained in digital form; accordingly, it proclaimed digital heritage as common heritage (UNESCO 2003). Unsurprisingly yet signifcantly, the acknowledgement was made in the context of endangered heritage, including digital, whose conservation and protection must be considered 'an urgent issue of worldwide concern' (ibid.).

The document also offcially distinguished between heritage created digitally (from then on referred to as digitally born heritage), that is, heritage for which no other format but the digital object exists, and digitised heritage, heritage 'converted into digital form from existing analogue resources' (UNESCO 2003). Therefore, as per heritage tradition, the semantic motivation behind digitisation was that of preserving cultural resources from feared deterioration or forever disappearance. It has been argued, however, that by distinguishing between the two types of digital heritage, the UNESCO statement de facto framed the digitisation process as a *heritagising* operation in itself (Cameron 2021). Consequently, to the classic cultural heritage paradigm 'preserved heritage = heritage worth preserving', UNESCO added another layer of complexity: the equation 'digitised = preserved' (ibid.).

UNESCO's acknowledgement of digital heritage and in particular of digitised heritage as common heritage has undoubtedly had profound implications for our understanding of heritage practices, material culture and preservation. For example, by offcially introducing the digital in relation to heritage, UNESCO's statement deeply affected traditional notions of authenticity, originality, permanent preservation and completeness which have historically been central to heritage conceptualisations. For the purposes of this book, I will simplify the discussion1 by saying that more traditional positions have insisted on the intrinsic lack of authority of copies, what Benjamin famously called the 'aura' of an object (Benjamin 1939). Museums' culture has conventionally revolved around these traditional, rigid rules of originality and authenticity, established as *the* values legitimising them as the only accredited custodians of true knowledge. Historically, such understanding of heritage has sadly gone hand in hand with a very specifc discourse, the one dominated by Western perspectives. These views have been based on ideas of old, grandiose sites and objects as being the sole heritage worthy of preservation which have in turn perpetuated Western narratives of nation, class and science (ACHS 2012).

More recent scholarship, however, has moved away from such objectcentred views and reworked conventional conceptualisations of authenticity and completeness in relation to the digital (see for instance, Council on Library and Information Resources 2000; Jones et al. 2018; Goriunova 2019; Zuanni 2020; Cameron 2021; Fickers 2021). From the 1980s onwards, for example, the infuence wielded by postmodernism and post-colonialism theories has challenged these traditional frameworks and brought new perspectives for the conceptualisation of material culture (see for instance, Tilley 1989; Vergo 1989). The idea key to this new approach particularly relevant to the arguments advanced in this book is that material culture does not intrinsically possess any meanings; instead, meanings are ascribed to material culture when interpreting it in the present. As Christopher Y. Tilley famously stated, 'The meaning of the past does not reside in the past, but belongs in the present' (Tilley 1989, 192). According to this perspective, the signifcance of material culture is not eternal and absolute but continually negotiated in a dialectical relationship with contemporary values and interactions. For example, in disciplines such as museum studies, this view takes the form of a critique of the social and political role of heritage institutions. Through this lens, museums are not seen as neutral custodians of material culture but as grounded in Western ideologies of elitism and power and representing the interests of only a minority of the population (Vergo 1989).

Such considerations have led to the emergence of new disciplines such as *Critical Heritage Studies* (CHS). In CHS, heritage is understood as a continuous negotiation of past and present modularities in the acknowledgement that heritage values are not fxed nor universal, rather they are culturally situated and constantly co-constructed (Harrison 2013). Though still aimed at preserving and managing heritage for future generations, CHS are resolutely concerned with questions of power, inequality and exploitation (Hall 1999; Butler 2007; Winter 2011) thus showing much of the same foci of interest as critical posthumanities (Braidotti 2019) and perfectly intersecting with the post-authentic framework I propose in this book.

The offcial introduction of the digital in the context of cultural heritage has necessarily become intertwined with the political and ideological legacy concerning traditional notions of original and authentic vs copies and reproductions. Simplistically seen as mere immaterial copies of the original, digital objects could not but severely disrupt these fundamental values, in some cases going as far as being framed as 'terrorists'(Cameron 2007, 51), that is destabilising instruments of what is true and real. In an effort to defend material authenticity as the sole element defning meaning, digital artefacts were at best bestowed an inferior status in comparison to the originals, a servant role to the real.

The parallel with DH vs 'mainstream humanities' is hard to miss (*cfr.* Chap. 1). In 2012, Alan Liu had defned DH as 'ancillary' to mainstream humanities (Liu 2012), whereas others (Allington et al. 2016; Brennan 2017, e.g.,) had claimed that by incorporating the digital into the humanities, its very essence, namely agency and criticality, was violated, one might say polluted. In opposition to the analogue, the digital was seen as an immaterial, agentless and untrue threatening entity undermining the authority of the original. Similar to digital heritage objects, these criticisms of DH did not problematise the digital but simplistically reduced it to a non-human, uncritical entity.

Nowadays, this view is increasingly challenged by new conceptual dimensions of the digital; for instance Jones et al. (2018) argue that 'a preoccupation with the virtual object—and the binary question of whether it is or is not authentic—obscures the wider work that digital objects do' (Jones et al. 2018, 350). Similarly, in her exploration of the digital subject, Olga Goriunova (2019) reworks the notion of distance in Valla and Benenson's artwork in which a digital artefact is described as 'neither an object nor its representation but a distance between the two' (2014). Far from being a blank void, this distance is described as a 'thick' space in which humans, entities and processes are connected to each other (ibid., 4) according to the various forms of power embedded in computational processes. According to this view, the concept of authenticity is considered in relation to the digital subject, i.e., the digital self, which is rethought as a much more complex entity than just a collection of data points and at the same time, not quite a mere extension of the self. More recently, Cameron (2021) states that in the context of digital cultural heritage, the very conceptualisation of a digital object escapes Western ideas of curation practices, and authenticity 'may not even be something to aspire to' (15).

This chapter wants to expand on these recent positions, not because I disagree with the concepts and themes expressed by these authors, but because I want to add a novel refection on digital objects, including digital heritage, and on both theory and practice-oriented aspects of digital knowledge creation more widely. I argue that such aspects are in urgent need of reframing not solely in museum and gallery practices, and heritage policy and management, but crucially also in any context of digital knowledge production and dissemination where an outmoded framework of discipline compartmentalisation persists. Taking digital cultural heritage as an illustrative case of a digital object typical of humanities scholarship, I devote specifc attention to the way in which digitisation has been framed and understood and to the wider consequences for our understanding of heritage, memory and knowledge.

# 2.2 DIGITAL CONSEQUENCES

This book challenges traditional notions of authenticity by arguing for a reconceptualisation of the digital as an organic entity embedding past, present and future experiences which are continuously renegotiated during any digital task (Cameron 2021). Specifcally, I expand on what Cameron calls the 'ecological composition concept' (ibid., 15) in reference to digital cultural heritage curation practices to include any action in a digital setting, also understood as bearing context and therefore consequences. She argues that the act of digitisation does not merely produce immaterial copies of their analogue counterparts—as implied by the 2003 UNESCO statement with reference to digitised cultural heritage—but by creating digital objects, it creates new things which in turn become alive, and which therefore are themselves subject to renegotiation. I further argue that any digital operation is equally situated, never neutral as each in turn incorporates external, situated systems of interpretation and management. For example, the digitisation of cultural heritage has been discursively legitimised as a heritigising operation, i.e., an act of preservation of cultural resources from deterioration or disappearance. Though certainly true to an extent, preservation is only one of the many aspects linked to digitisation and by far not the only reason why governments and institutions have started to invest massively in it. In line with the wider benefts that digitisation is thought to bring at large (*cfr.* Chap. 1), the digitisation of cultural heritage is believed to serve a range of other more strategic goals such as fuelling innovation, creating employment opportunities, boosting tourism and enhancing visibility of cultural sites including museums, libraries and archives, all together leading to economic growth (European Commission 2011).

Inevitably, the process of cultural heritage digitisation itself has therefore become intertwined with questions of power, economic interests, ideological struggles and selection biases. For instance, after about two decades of major, large-scale investments in the digitisation of cultural heritage, selfreported data from cultural heritage institutions indicate that in Europe, only about 20% of heritage material exists in a digital format (Enumerate Observatory 2017), whereas globally, this percentage is believed to remain at 15%.2 Behind these percentages, it is very hard not to see the colonial ghosts from the past. CHS have problematised heritage designation not just as a magnanimous act of preserving the past, but as 'a symbol of previous societies and cultures' (Evans 2003, 334). When deciding which societies and whose cultures, political and economic interests, power relations and selection biases are never far away. For example, particularly in the frst stages of large-scale mass digitisation projects, special collections often became the prioritised material to be digitised (Rumsey and Digital Library Federation 2001), whereas less mainstream works and minority voices tended to be largely excluded. Typically, libraries needed to decide what to digitise based on cost-effective analyses and so their choices were often skewed by economic imperatives rather than 'actual scholarly value' (Rumsey and Digital Library Federation 2001). The UNESCO-induced paradigm 'digitising = preserving' contributed to communicate the idea that any digitised material was intrinsically worth preserving, thus in turn perpetuating previous decisions about what had been worth keeping (Crymble 2021).

There is no doubt that today's under-representation of minority voices in digital collections directly mirrors decades of past decisions about what to collect and preserve (Lee 2020). In reference to early US digitisation programmes, for example, Rumsey Smith points out that as a direct consequence of this reasoning:

foreign language materials are nearly always excluded from consideration, even if they are of high research value, because of the limitations of optical character recognition (OCR) software and because they often have a limited number of users. (Rumsey and Digital Library Federation 2001, 6)

This has in turn had other repercussions. As most of the digitised material has been in English, tools and software for exploring and analysing the past have primarily been developed for the English language. Although in recent years greater awareness around issues of power, archival biases, silences in the archives and lack of language diversity within the context of digitisation has certainly developed not just in archival and heritage studies, but also in DH and digital history (see for instance, Risam 2015; Putnam 2016; Earhart 2019; Mandell 2019; McPherson 2019; Noble 2019), the fact remains that most of that 15% is the sad refection of this bitter legacy.

Another example of the situated nature of digitisation is microflming. In his famous investigative book *Double Fold*, Nicholson Baker (2002) documents in detail the contextual, economic and political factors surrounding microflming practices in the United States. Through a zealous investigation, he tells us a story involving microflm lobbyists, former CIA agents and the destruction of hundreds of thousands of historical newspapers. He pointedly questions the choices of high-profle fgures in American librarianship such as Patricia Battin, previous Head Librarian of Columbia University and the head of the American Commission on Preservation and Access from 1987 to 1994. From the analysis of government records and interviews with persons of interest, Baker argues that Battin and the Commission pitched the mass digitisation of paper records to charitable foundations and the American government by inventing the 'brittle book crisis', the apparent rapid deterioration that was destroying millions of books across America (McNally 2002). In reality, he maintains, her convincing was part of an agenda to provide content for the microflming technology.

In advocating for preservation, Baker also discusses the limitations of digitisation and some specifc issues with microflming, such as loss of colour and quality and grayscale saturation. Such issues have had over the years unpredictable consequences, particularly for images. In historical newspapers, some images used to be printed through a technique called rotogravure, a type of intaglio printing known for its good quality image reproduction and especially well-suited for capturing details of dark tones. Scholars (i.e., Williams 2019; Lee 2020) have pointed out how the grayscale saturation issue of microflming directly affects images of Black people as it distorts facial features by achromatising the nuances. In the case of millions and millions of records of images digitised from microflm holdings, such as the 1.56 million images in the Library of Congress' *Chronicling America* collection, it has been argued that the microflming process itself has acted as a form of oppression for communities of colour (Williams 2019). This together with several other criticisms concerning selection biases have led some authors to talk about *Chronicling White America* (Fagan 2016).

In this book I argue in favour of a more problematised conceptualisation of digital objects and digital knowledge creation as living entities that bear consequences. To build my argument, I draw upon posthuman critical theory which understands the matter as an extremely convoluted assemblage of components, 'complex singularities relate[d] to a multiplicity of forces, entities, and encounters'(Braidotti 2017, 16). Indeed, for its deconstructing and disruptive take, I believe the application of posthumanities theories has great potential for refguring traditional humanist forms of knowledge. Although I discuss examples of my own research based on digital cultural heritage material, my aim is to offer a counter-narrative beyond cultural heritage and with respect to the digitisation of society. My intention is to challenge the dominant public discourse that continues to depict the digital as non-human, agentless, non-authentic and contextless and by extension digital knowledge as necessarily non-human, *cultureless* and bias-free. The digitisation of society sharply accelerated by the COVID-19 pandemic has added complexity to reality, precipitating processes that have triggered reactions with unpredictable, potentially global consequences. I therefore maintain that with respect to digital objects, digital operations and to the way in which we use digital objects to create knowledge, it is the notion of the digital itself that needs reframing. In the next section, I introduce the two concepts that may inform such radical reconfguration: *symbiosis* and *mutualism*.

# 2.3 SYMBIOSIS, MUTUALISM AND THE DIGITAL OBJECT

This book recognises the inadequacy of the traditional model of knowledge creation, but it also contends that the 2020 pandemic-induced pervasive digitisation has added further urgency to the point that this change can no longer be deferred. Such re-fgured model, I argue, must conceptualise the digital object as an organic, dynamic entity which lives and evolves and bears consequences. It is precisely the unpredictability and long-term nature of these consequences that now pose extremely complex questions which the current rigid, single discipline-based model of knowledge creation is ill-equipped to approach.3 This book is therefore an invitation for institutions as well as for us as researchers and teachers to address what it means to produce knowledge today, to ask ourselves how we want our digital society to be and what our shared and collective priorities are, and so to fnally produce the change that needs to happen.

As a new principle that goes beyond the constraints of the canonical forms, posthuman critical theory has proposed *transversality*, 'a pragmatic method to render problems multidimensional' (Braidotti and Fuller 2019, 1). With this notion of geometrical transversality that describes spaces 'in terms of their intersection' (ibid., 9), posthuman critical theory attempts to capture 'relations between relations'. I argue, however, that the suggested image of a transversal cut across entities that were previously disconnected, e.g., disciplines, does not convey the idea of fuid exchanges; rather, it remains confned in ideas of separation and interdisciplinarity and therefore it only partially breaks with the outdated conceptualisations of knowledge compartmentalisation that it aims to disrupt. The term *transversality*, I maintain, ultimately continues to frame knowledge as solid and essentially separated.

This book frmly opposes notions of divisions, including a division of knowledge into monolithic disciplines, as they are based on models of reality that support individualism and separateness which in turn inevitably lead to confict and competition. To support my argument of an urgent need for knowledge reconfguration and for new terminologies, I propose to borrow the concept of *symbiosis* from biology. The notion of symbiosis from Greek 'living together' refers in biology to the close and longterm cooperation between different organisms (Sims 2021). Applied to knowledge remodelling and to the digital, *symbiosis* radically breaks with the current conceptualisation of knowledge as a separate, static entity, linear and fragmented into multiple disciplines and of the digital as an agentless entity. To the contrary, the term *symbiosis* points to the continual renegotiation in the digital of interactions, past, present and future systems, power relations, infrastructures, interventions, curations and curators, programmers and developers (see also Cameron 2021).

Integral to the concept of *symbiosis* is that of *mutualism*; *mutualism* opposes interspecifc competition, that is, when organisms from different species compete for a resource, resulting in benefting only one of the individuals or populations involved (Bronstein 2015). I maintain that the current rigid separation in disciplines resembles an interspecifc competition dynamic as it creates the conditions for which knowledge production has become a space of confict and competition. As it is not only outdated and inadequate but indeed deeply concerning, I therefore argue that the contemporary notion of knowledge should not simply be redefned but that it should be reconceptualised altogether. *Symbiosis* and *mutualism* embed in themselves the principle of knowledge as fuid and inseparable in which areas of knowledge do not compete against each other but beneft from a mutually compensating relationship. When asking ourselves the questions 'How do we produce knowledge today?', 'How do we want our next generation of students to be trained?', the concepts of *symbiosis* and *mutualism* may guide the new reconfguration of our understanding of knowledge in the digital.

*Symbiosis* and *mutualism* are central notions for the development of a more problematised conceptualisation of digital objects and digital knowledge production. Expanding on Cameron's critique of the conceptual attachment to digital cultural heritage as possessing a complete quality of objecthood (Cameron 2021, 14), I maintain that it is not just digital heritage and digital heritage practices that escape notions of completeness and authenticity but in fact *all* digital objects and *all* digital knowledge creation practices. According to this conceptualisation, any intervention on the digital object (e.g., an update, data augmentation interventions, data creation for visualisations) should always be understood as the sum of all the previously made and concurrent decisions, not just by the present curator/analyst, but by external, past actors, too (see for instance, the example of microflming discussed in Sect. 2.2). These decisions in turn shape and are shaped by all the following ones in an endless cycle that continually transforms and creates new object forms, all equally alive, all equally bearing consequences for present and future generations. This is what Cameron calls the 'more-than-human', a convergence of the human and the technical.

I maintain, however, that the 'more-than-human' formulation still presupposes a lack of human agency in the technical (the supposedly nonhuman) and therefore a yet again binary view of reality. In Cameron's view, the more-than-human arises from the encounter of human agency with the technical, which therefore would not possess agency per se. But agency does not uniquely emerge from the interconnections between let's say the curator (what could be seen as 'the human') and the technical components (i.e., 'the non-human') because there is no concrete separation between the human and the technical and in truth, there is no such a thing as *neutral technology* (see Sect. 1.2). For example, in the practices of early largescale digitisation projects, past decisions about what to (not) digitise have eventually led to the current English-centric predominance of data-sets, software libraries, training models and algorithms. Using this technology today contributes to reinforce Western, white worldviews not just in digital practices, but in society at large.

Hence, if Cameron believes that framing digital heritage as 'possessing a fundamental original, authentic form and function […] is limiting' (ibid.,12), I elaborate further and maintain that it is in fact misleading. Indeed, in constituting and conceptualising digital objects, the question of whether it is or it is not authentic *truly* doesn't make sense; digital objects transcend authenticity; they are *post-authentic*. To conceptualise digital objects as post-authentic means to understand them as unfnished processes that embed a wide net of continually negotiable relations of multiple internal and external actors, past, present and future experiences; it means to look at the human and the technical as symbiotic, non-discriminable elements of the digital's immanent nature which is therefore understood as situated and consequential. To this end, I introduce a new framework that could inform practices of knowledge reconfguration: the *post-authentic framework*. The post-authentic framework problematises digital objects by pointing to their aliveness, incompleteness and situatedness, to their entrenched power relations and digital consequences. Throughout the book, I will unpack key theoretical concepts of the post-authentic framework and, through the illustration of four concrete examples of knowledge creation in the digital—creation of digital material, enrichment of digital material, analysis of digital material and visualisation of digital material—I evaluate its full implications for knowledge creation.

## 2.4 CREATION OF DIGITAL OBJECTS

The post-authentic framework acknowledges digital objects as situated, unfnished processes that embed a wide net of continually negotiable relations of multiple actors. It is within the post-authentic framework that I describe the creation of *ChroniclItaly 3.0* (Viola and Fiscarelli 2021a), a digital heritage collection of Italian American newspapers published in the United States by Italian immigrants between 1898 and 1936. I take the formation and curation of this collection as a use case to demonstrate how the post-authentic framework can inform the creation of a digital object in general, reacting to and impacting on institutional and methodological frameworks for knowledge creation. In the case of *ChroniclItaly 3.0*, this includes effects on the very conceptualisation of heritage and heritage practices.

Being the third version of the collection, *ChroniclItaly 3.0* is in itself a demonstration of the continuously and rapidly evolving nature of digital research and of the intrinsic incompleteness of digital objects. I created the frst version of the collection, *ChroniclItaly* (Viola 2018) within the framework of the Transatlantic research project *Oceanic Exchanges* (OcEx) (Cordell et al. 2017). OcEx explored how advances in computational periodicals research could help historians trace and examine patterns of information fow across national and linguistic boundaries in digitised nineteenth-century newspaper corpora. Within OcEx, our frst priority was therefore to study how news and concepts travelled between Europe and the United States and how, by creating intricate entanglements of informational exchanges, these processes resulted in transnational linguistic and cultural contact phenomena. Specifcally, we wanted to investigate how historical newspapers and Transatlantic reporting shaped social and cultural cohesion between Europeans in the United States and in Europe. One focus was specifcally on the role of migrant communities as nodes in the Transatlantic transfer of culture and knowledge (Viola and Verheul 2019a). As the main aim was to trace the linguistic and cultural changes that refected the migratory experience of these communities, we frst needed to obtain large quantities of diasporic newspapers that would be representative of the Italian ethnic press at the time. Because of the project's time and costs limitations, such sources needed to be available for computational textual analysis, i.e., already digitised. This is why I decided to machine harvest the digitised Italian American newspapers from *Chronicling America*, <sup>4</sup> the Open Access, Internet-based Library of Congress directory of digitised historical newspapers published in the United States from 1777 to 1963. *Chronicling America* is also an ongoing digitisation project which involves the National Digital Newspaper Program (NDNP), the National Endowment for the Humanities (NEH), and the Library of Congress. Started in 2005, the digitisation programme continuously adds new titles and issues through the funding of digitisation projects awarded to external institutions, mostly universities and libraries, and thus in itself it encapsulates the intrinsic incompleteness of digital infrastructures and digital objects and the far-reaching network of infuencing factors and actors involved.

This wider net of interrelations that infuence how digital objects come into being and which equally infuenced the *ChroniclItaly* collections can be exemplifed by the criteria to receive the *Chronicling America* grant. In line with the main NDNP's aim 'to create a national digital resource of *historically signifcant* newspapers published between 1690 and 1963, from all the states and U.S. territories' (emphasis mine NEH 2021, 1), institutions should digitise approximately 100,000 newspaper pages representing their state. How this signifcance is assessed depends on four principles. First, titles should represent the political, economic and cultural history of the state or territory; second, titles recognised as 'papers of record', that is containing 'legal notices, news of state and regional governmental affairs, and announcements of community news and events' are preferred (ibid., 2). Third, titles should cover the majority of the population areas, and fourth, titles with longer chronological runs and that have ceased publication are prioritised. Additionally, applicants must commit to assemble an advisory board including scholars, teachers, librarians and archivists to inform the selection of the newspapers to be digitised. The requirement that most heavily conditions which titles are included in *Chronicling America*, however, is the existence of a complete, or largely complete microflm 'object of record' with priority given to higher-quality microflms. In terms of technical requirements, this criterion is adopted for reasons of effciency and cost; however, as in the past microflming practices in the United States were entrenched in a complex web of interrelated factors (*cfr.* Sect. 2.2), the impact of this criterion on the material included in the directory incorporates issues such as previous decisions of what was worth microflming and more importantly, what was not.

Furthermore, to ensure consistency across the diverse assortment of institutions involved over the years and throughout the various grant cycles, the programme provides awardees with further technical guidelines. At the same time, however, these guidelines may cause over-representation of larger or mainstream publications; therefore, to counterbalance this issue, titles that give voice to under-represented communities are highly encouraged. Although certainly mitigated by multiple review stages (i.e., by each state awardee's advisory board, by the NEH and peer review experts), the very constitutional structure of *Chronicling America* reveals the far-reaching net of connections, economic and power relations, multiple actors and factors infuencing the decisions about what to digitise. Signifcantly, it exposes how digitisation processes are intertwined with individual institutions' research agendas and how these may still embed and perpetuate past archival biases.

The creation of *ChroniclItaly* therefore 'inherits' all these decisions and processes of mediation and in turn embeds new ones such as those stemming from the research aims of the project within which it was created, i.e., OcEx, and the expertise of the curator, i.e., myself. At this stage, for example, we decided to not intervene on the material with any enriching operation as *ChroniclItaly* mainly served as the basis for a combination of discourse and text analyses investigations that could help us research to which extent diasporic communities functioned as nodes and contact zones in the Transatlantic transfer of information.

As we explored the collection further, we realised however that to limit our analyses to text-based searches would not exploit the full potential of the archive; we therefore expanded the project with additional grant money earned through the Utrecht University's *Innovation Fund for Research in IT*. We made a case for the importance of experimenting with computational methodologies that would allow humanities scholars to identify and map the spatial dimension of digitised historical data as a way to access subjective and situational geographical markers. It is with this aim in mind that I created *ChroniclItaly 2.0* (Viola 2019), the version of the collection annotated with referential entities (i.e., people, places, organisations). As part of this project, we also developed the app *GeoNewsMiner* (GNM)<sup>5</sup> (Viola et al. 2019). This is an interactive graphical user interface (GUI) to visually and interactively explore the references to geographical entities in the collection. Our aim was to allow users to conduct historical, fnergrained analyses such as examining changes in mentions of places over time and across titles as a way to identify the subjective and situational dimension of geographical markers and connect them to explicit geo-references to space (Viola and Verheul 2020a).

The creation of the third version of the collection, *ChroniclItaly 3.0*, should be understood in the context of yet another project, *DeepteXTminER* (DeXTER)<sup>6</sup> (Viola and Fiscarelli 2021b) supported by the Luxembourg Centre for Contemporary and Digital History's (C2DH—University of Luxembourg) *Thinkering Grant*. Composed of the verbs *tinkering* and *thinking*, this grant funds research applying the method of 'thinkering': 'the tinkering with technology combined with the critical refection on the practice of doing digital history' (Fickers and Heijden 2020). As such, the scheme is specifcally aimed at funding innovative projects that experiment with technological and digital tools for the interpretation and presentation of the past. Conceptually, the C2DH itself is an international hub for refection on the methodological and epistemological consequences of the Digital Turn for history;7 it serves as a platform for engaging critically with the various stages of historical research (archiving, analysis, interpretation and narrative) with a particular focus on the use of digital methods and tools. Physically, it strives to actualise interdisciplinary knowledge production and dissemination by fostering 'trading zones' (Galison and Stump 1996; Collins et al. 2007), working environments in which interactions and negotiations between different disciplines can happen (Fickers and Heijden 2020). Within this institutional and conceptual framework, I conceived DeXTER as a postauthentic research activity to critically assess and implement different state-of-the-art natural language processing (NLP) and deep learning techniques for the curation and visualisation of digital heritage material. DeXTER's ultimate goal was to bring the utilised techniques into as close an alignment as possible with the principle of human agency (*cfr.* Chap. 3).

The larger ecosystem of the *ChroniclItaly* collections thus exemplifes the evolving nature of digital objects and how international and national processes interweave with wider external factors, all impacting differentially on the objects' evolution. The existence of multiple versions of *ChroniclItaly*, for example, is in itself a refection of the incompleteness of the *Chronicling America* project to which titles, issues and digitised material are continually added. *ChroniclItaly* and *ChroniclItaly 2.0* include seven titles and issues from 1898 to 1920 that portray the chronicles of Italian immigrant communities from four states (California, Pennsylvania, Vermont, and West Virginia); *ChroniclItaly 3.0* expands the two previous versions by including three additional titles published in Connecticut and pushing the overall time span to cover from 1898 to 1936. In terms of issues, *ChroniclItaly 3.0* almost doubles the number of included pages compared to its predecessors: 8653 vs 4810 of its previous versions. This is a clear example of how the formation of a digital object is impacted by the surrounding digital infrastructure, which in turn is dependent on funding availability and whose very constitution is shaped by the various research projects and the involved actors in its making.

## 2.5 THE IMPORTANCE OF BEING DIGITAL

Understanding digital objects as post-authentic objects means to acknowledge them as part of the complex interaction of countless factors and dynamics and to recognise that the majority of such factors and dynamics are invisible and unpredictable. Due to the extreme complexity of interrelated forces at play, the formidable task of writing both the past in the present and the future past demands careful handling. This is what Braidotti and Fuller call 'a meaningful response move from the relatively short chain of intention-to-consequence […] to the longer chains of consequences in which chance becomes a more structural force' (2019, 13). Here chance is understood as the unpredictable combination of all the numerous known and unknown actors involved, conscious and unconscious biases, past, present and future experiences, and public, private and personal interests. With specifc reference to the *ChroniclItaly* collections, for example, in addition to the already discussed multiple factors infuencing their creation, many of which date even decades before, the nature itself of this digital object and of its content bears signifcance for our conceptualisation of digital heritage and more broadly, for digital knowledge creation practices.

The collections collate immigrant press material. The immigrant press represents the frst historical stage of the ethnic press, a phenomenon associated with the mass migration to the Americas between the 1880s and 1920s, when it is estimated that over 24 million people from all around the world arrived to America (Bandiera et al. 2013). Indeed, as immigrant communities were growing exponentially, so did the immigrant press: at the turn of the twentieth century, about 1300 foreign-language newspapers were being printed in the United States with an estimated circulation of 2.6 million (Bjork 1998). By giving immigrants all sorts of practical and social advice—from employment and housing to religious and cultural celebrations and from learning English to acquiring American citizenship—these newspapers truly helped immigrants to transition into American society. As immigrant newspapers quickly became an essential element at many stages in an immigrant's life (Rhodes 2010, 48), the immigrant press is a resource of particularly valuable signifcance not only for studying the lives of many of the communities that settled in the United States but also for opening a comprehensive window onto the American society of the time (Viola and Verheul 2020a).

As far as the Italians were concerned, it has been calculated that by 1920, they were representing more than 10% of the non-US-born population (about 4 millions) (Wills 2005). The Italian community was also among the most prolifc newspapers' producers; between 1900 and 1920, there were 98 Italian titles that managed to publish uninterruptedly, whereas at its publication peak, this number ranged between 150 and 264 (Deschamps 2007, 81). In terms of circulation, in 1900, 691,353 Italian newspapers were sold across the United States (Park 1922, 304), but in New York alone, the circulation ratio of the Italian daily press is calculated as one paper for every 3.3 Italian New Yorkers (Vellon 2017, 10). Distribution and circulation fgures should however be doubled or perhaps even tripled, as illiteracy levels were still high among this generation of Italians and newspapers were often read aloud (Park 1922; Vellon 2017; Viola and Verheul 2019a; Viola 2021).

These impressive fgures on the whole may point to the infuential role of the Italian language press not just for the immigrant community but within the wider American context, too. At a time when the mass migrations were causing a redefnition of social and racial categories, notions of race, civilisation, superiority and skin colour had polarised into the binary opposition of white/superior vs non-white/inferior (Jacobson 1998; Vellon 2017; Viola and Verheul 2019a). The whiteness category, however, was rather complex and not at all based exclusively on skin colour. Jacobson (1998) for instance describes it as 'a system of "difference" by which one might be both white and racially distinct from other whites' (ibid., p. 6). Indeed, during the period covered by the *ChroniclItaly* collections, immigrants were granted 'white' privileges depending not on how white their skin might have been, rather on how white they were perceived (Foley 1997). Immigrants in the United States who were experiencing this uncertain social identity situation have been described as 'conditionally white' (Brodkin 1998), 'situationally white' (Roediger 2005) and 'inbetweeners' (among others Barrett and Roediger 1997; Guglielmo and Salerno 2003; Guglielmo 2004; Orsi 2010).

This was precisely the complicated identity and social status of Italians, especially of those coming from Southern Italy; because of their challenging economic and social conditions and their darker skin, both other ethnic groups and Americans considered them as socially and racially inferior and often discriminated against them (LaGumina 1999; Luconi 2003). For example, Italian immigrants would often be excluded by employment and housing opportunities and be victims of social discrimination, exploitation, physical violence and even lynching (LaGumina 1999; Connell and Gardaphé 2010; Vellon 2010; LaGumina 2018; Connell and Pugliese 2018). The social and historical importance of Italian immigrant newspapers is found in how they advocated the rights for the community they represented, crucially acting as powerful inclusion, community building and national identity preservation forces, as well as language and cultural retention tools. At the same time, because such advocate role was often paired with the condemnation of American discriminatory practices, these newspapers also performed a decisive transforming role of American society at large, undoubtedly contributing to the tangible shaping of the country. The immigrant press and the *ChroniclItaly* collections can therefore be an extremely valuable source to investigate specifcally how the internal mechanisms of cohesion, class struggle and identity construction of the Italian immigrant community contributed to transform America.

Lastly, these collections can also bring insights into the Italian immigrants' role in the geographical shaping of the United States. The majority of the 4 million Italians that had arrived to the United States—mostly uneducated and mostly from the south—had done so as the result of chain migration. Naturally, they would settle closely to relatives and friends, creating self-contained neighbourhoods clustered according to different regional and local affliations (MacDonald and MacDonald 1964). Through the study of the geographical places contained in the collections as well as the place of publication of the newspapers' titles, the *ChroniclItaly* collections provide an unconventional and traditionally neglected source for studying the transforming role of migrants for host societies.

On the whole, however, the novel contribution of the *ChroniclItaly* collections comes from the fact that they allow us to devote attention to the study of historical migration as a process experienced by the migrants themselves (Viola 2021). This is rare as in discourse-based migration research, the analysis tends to focus on discourse *on* migrants, rather than *by* migrants (De Fina and Tseng 2017; Viola 2021). Instead, through the analysis of migrants' narratives, it is possible to explore how displaced individuals dealt with social processes of migration and transformation and how these affected their inner notions of identity and belonging. A large-scale digital discourse-based study of migrants' narratives creates a mosaic of migration, a collective memory constituted by individual stories. In this sense, the importance of being digital lies in the fact that this information can be processed on a large-scale and across different migrants' communities. The digital therefore also offers the possibility– perhaps unimaginable before—of a kaleidoscopic view that simultaneously apprehends historical migration discourse as a combination of inner and outer voices across time and space. Furthermore, as records are regularly updated, observations can be continually enriched, adjusted, expanded, recalibrated, generalised or contested. At the same time, mapping these narratives creates a shimmering network of relations between the past migratory experiences of diasporic communities and contemporary migration processes experienced by ethnic groups, which can also be compared and analysed both as active participants and spectators.

Abby Smith Rumsey said that the true value of the past is that it is the raw material we use to create the future (Rumsey 2016). It is only through gaining awareness of these spatial temporal correspondences that the past can become part of our collective memory and, by preventing us from forgetting it, of our collective future. Understanding digital objects through the post-authentic lens entails that great emphasis must be given on the processes that generate the mappings of the correspondences. The post-authentic framework recognises that these processes cannot be neutral as they stem from systems of interpretation and management which are situated and therefore partial. These processes are never complete nor they can be completed and as such they require constant update and critical supervision.

In the next chapter, I will illustrate the second use case of this book data augmentation; the case study demonstrates that the task of enriching a digital object is a complex managerial activity, made up of countless critical decisions, interactions and interventions, each one having consequences. The application of the post-authentic framework for enriching *ChroniclItaly 3.0* demonstrates how symbiosis and mutualism can guide how the interaction with the digital unfolds in the process of knowledge creation. I will specifcally focus on why computational techniques such as optical character recognition (OCR), named entity recognition (NER), geolocation and sentiment analysis (SA) are problematic and I will show how the post-authentic framework can help address the ambiguities and uncertainties of these methods when building a source of knowledge for current and future generations.

## NOTES


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# The Opposite of Unsupervised

When you control someone's understanding of the past, you control their sense of who they are and also their sense of what they can imagine becoming. (Abby Smith Rumsey, 2016)

# 3.1 ENRICHMENT OF DIGITAL OBJECTS

After the initial headlong rush to digitisation, libraries, museums and other cultural heritage institutions realised that simply making sources digitally available did not ensure their use; what in fact became apparent was that as the body of digital material grew, users' engagement decreased. This was rather disappointing but more importantly, it was worrisome. Millions had been poured into large-scale digitisation projects, pitched to funding agencies as the ultimate Holy Grail of cultural heritage (*cfr.* Chap. 2), a safe, more effcient way to protect and preserve humanity's artefacts and develop new forms of knowledge, simply unimaginable in the pre-digital era. Although some of it was true, what had not been anticipated was the increasing diffculty experienced by users in retrieving meaningful content, a diffculty that corresponded to the rate of digital expansion. Especially when paired with poor interface design, frustrated users were left with an overall unpleasant experience, feeling overwhelmed and dissatisfed.

Thus, to earn the return on investment in digitisation, institutions urgently needed novel approaches to maximise the potential of their digital collections. It soon became obvious that the solution was to simplify and improve the process of exploring digital archives, to make information retrievable in more valuable ways and the user experience more meaningful on the whole. Naturally, within the wider incorporation of technology in all sectors, it is not at all surprising that ML and AI have been more than welcomed to the digital cultural heritage table. Indeed, AI is particularly appreciated for its capacity to automate lengthy and boring processes that nevertheless enhance exploration and retrieval for conducting more indepth analyses, such as the task of annotating large quantities of digital textual material with referential information. Indeed, as this technology continues to develop together with new tools and methods, it is more and more used to help institutions fulfl the main purposes of heritagisation: knowledge preservation and access.

One widespread way to enhance access is through 'content enrichment' or just enrichment for short. It consists of a wide range of techniques implemented to achieve several goals from improving the accuracy of metadata for better content classifcation1 to annotating textual content with contextual information, the latter typically used for tasks such as discovering layers of information obscured by data abundance (see, for instance, Taylor et al. 2018; Viola and Verheul 2020a). There are at least four main types of text annotation: entity annotation (e.g., named entity recognition—NER), entity linking (e.g., entity disambiguation), text classifcation and linguistic annotation (e.g., parts-of-speech tagging—POS). Content enrichment is also often used by digital heritage providers to link collections together or to populate ontologies that aim to standardise procedures for digital sources preservation, help retrieval and exchange (among others Albers et al. 2020; Fiorucci et al. 2020).

The theoretical relevance of performing content enrichment, especially for digital heritage collections, lies precisely in its great potential for discovering the cultural signifcance underneath referential units, for example, by cross-referencing them with other types of data (e.g., historical, social, temporal). We enriched *ChroniclItaly 3.0* for NER, geocoding and sentiment within the context of the DeXTER project. Informed by the post-authentic framework, DeXTER combines the creation of an enrichment workfow with a meta-refection on the workfow itself. Through this symbiotic approach, our intention was to prompt a fundamental rethink of both the way digital objects and digital knowledge creation are understood and the practices of digital heritage curation in particular.

It is all too often assumed that enrichment, or at least parts of it, can be fully automated, unsupervised and even launched as a one-step pipeline. Preparing the material to be ready for computational analysis, for example, often ambiguously referred to as 'cleaning', is typically presented as something not worthy of particular critical scrutiny. We are misleadingly told that operations such as tokenisation, lowercasing, stemming, lemmatisation and removing stopwords, numbers, punctuation marks or special characters don't need to be problematised as they are rather tedious, 'standard' operations. My intention here is to show how it is on the contrary paramount that any intervention on the material is tackled critically. When preparing the material for further processing, full awareness of the curator's infuential role is required as each one of the taken actions triggers different chain reactions and will therefore output different versions of the material. To implement one operation over another infuences how the algorithms will process such material and ultimately, how the collection will be enriched, the information accessed, retrieved and fnally interpreted and passed on to future generations (Viola and Fiscarelli 2021b, 54).

Broadly, the argument I present provokes a discussion and critique of the fetishisation of empiricism and technical objectivity not just in humanities research but in knowledge creation more widely. It is this critical and humble awareness that reduces the risks of over-trusting the pseudo-neutrality of processes, infrastructures, software, categories, databases, models and algorithms. The creation and enrichment of *ChroniclItaly 3.0* show how the conjuncture of the implicated structural forces and factors cannot be envisioned as a network of linear relations and as such, cannot be predicted. The acknowledgement of the limitations and biases of specifc tools and choices adopted in the curation of *ChroniclItaly 3.0* takes the form of a thorough documentation of the steps and actions undertaken during the process of creation of the digital object. In this way, it is not just the *product*, however incomplete, that is seen as worthy of preservation for current and future generations, but also equally the *process* (or indeed processes) for creating it. Products and processes are unfxed and subject to change, they transcend questions of authenticity; they allow room for multiple versions, all equally post-authentic, in that they may refect different curators and materials, different programmers, rapid technological advances, changing temporal frameworks and values.

# 3.2 PREPARING THE MATERIAL

How to critically assess which ones of the preparatory operations for enrichment one should perform depends on internal factors such as the language of the collection; the type of material; the specifc enrichment tasks to follow as well as external factors such as the available means and resources, both technical and fnancial; the time-frame, the intended users and research aims; the infrastructure that will store the enriched collection; and so forth. Indeed, far from being 'standard', each intervention needs to be specifcally tailored to individual cases. Moreover, since each operation is factually an additional layer of manipulation, it is fundamental that scholars, heritage operators and institutions assess carefully to what degree they want to intervene on the material and how, and that their decisions are duly documented and motivated. In the case of *ChroniclItaly 3.0*, for example, the documentation of the specifc preparatory interventions taken towards enriching the collection, namely, tokenisation, removing numbers and dates and removing words with less than two characters and special characters, is embedded as an integral part of the actual workfow. I wanted to signal the need for refguring digital knowledge creation practices as honest and fuid exchanges between the computational and human agency, counterbalancing the narrative that depicts computational techniques as autonomous processes from which the human is (should be?) removed. Thus, as a thoughtful post-authentic project, I have considered each action as part of a complex web of interactions between the multiple factors and dynamics at play with the awareness that the majority of such factors and dynamics are invisible and unpredictable. Signifcantly, the documentation of the steps, tools and decisions serves the valuable function of acknowledging such awareness for contemporary and future generations.

This process can be envisioned as a continuous dialogue between human and artifcial intelligence and it can be illustrated by describing how we handled stopwords (e.g., prepositions, articles, conjunctions) and punctuation marks when preparing *ChroniclItaly 3.0* for enrichment. Typically, stopwords are reputed to be semantically non-salient and even potentially disruptive to the algorithms' performance; as such, they are normally removed automatically. However, as they are of course languagebound, removing these items indiscriminately can hinder future analyses having more destructive consequences than keeping them. Thus, when enriching *ChroniclItaly 3.0*, we considered two fundamental factors: the language of the data-set—Italian—and the enrichment actions to follow, namely, NER, geocoding and SA. For example, we considered that in Italian, prepositions are often part of locations (e.g., *America del Nord*— North America), organisations (e.g., *Camera del Senato*—the Senate) and people's names (e.g., Gabriele **d**'Annunzio); removing them could have negatively interfered with how the NER model had been trained to recognise referential entities. Similarly, in preparation for performing SA at sentence level (*cfr.* Sect. 3.4), we did not remove punctuation marks; in Italian punctuation marks are typical sentence delimiters; therefore, they are indispensable for the identifcation of sentences' boundaries.

Another operation that we critically assessed concerns the decision to whether to lowercase the material before performing NER and geocoding. Lowercasing text before performing other actions can be a double-edged sword. For example, if lowercasing is not implemented, a NER algorithm will likely process tokens such as 'USA', 'Usa', 'usa', 'UsA' and 'uSA' as distinct items, even though they may all refer to the same entity. This may turn out to be problematic as it could provide a distorted representation of that particular entity and how it is connected to other elements in the collection. On the other hand, if the material is lowercased, it may become diffcult for the algorithm to identify 'usa' as an entity at all,2 which may result in a high number of false negatives, thus equally skewing the output. We, once again, intervened as human agents: we considered that entities such as persons, locations and organisations are typically capitalised in Italian and therefore, in preparation for NER and geocoding, lowercasing was not performed. However, once these steps were completed, we did lowercase the entities and following a manual check, we merged multiple items referring to the same entity. This method allowed us to obtain a more realistic count of the number of entities identifed by the algorithm and resulted in a signifcant redistribution of the entities across the different titles, as I will discuss in Sect. 3.3. Albeit more accurate, this approach did not come without problems and repercussions; many false negatives are still present and therefore the tagged entities are NOT all the entities in the collections. I will return to this point in Chap. 5.

The decision we took to remove numbers, dates and special characters is also a good example of the importance of being deeply engaged with the specifcity of the source and how that specifcity changes the application of the technology through which that engagement occurs. Like the large majority of the newspapers collected in *Chronicling America*, the pages forming *ChroniclItaly 3.0* were digitised primarily from microflm holdings; the collection therefore presents the same issues common to OCR-generated searchable texts (as opposed to born digital texts) such as errors derived from low readability of unusual fonts or very small characters. However, in the case of *ChroniclItaly 3.0*, additional factors must be considered when dealing with OCR errors. The newspapers aggregated in the collection were likely digitised by different NDNP awardees, who probably employed different OCR engines and/or chose different OCR settings, thus ultimately producing different errors which in turn affected the collection's accessibility in an unsystematic way. Like all ML predictions models, OCR engines embed the various biases encoded not only in the OCR engine's architecture but more importantly, in the data-sets used for training the model (Lee 2020). These data-sets typically consist of sets of transcribed typewritten pages which embed the human subjectivity (e.g., spelling errors) as well as individual decisions (e.g., spelling variations).

All these factors have wider, unpredictable consequences. As previously discussed in reference to microflming (*cfr.* Sect. 2.2), OCR technology has raised concerns regarding marginalisation, particularly with reference to the technology's consequences for content discoverability (Noble 2018; Reidsma 2019). These scholars have argued that this issue is closely related to the fact that the most largely implemented OCR engines are both licensed and opaquely documented; they therefore not only refect the strategic, commercial choices made by their creators according to specifc corporate logics but they are also practically impossible to audit. Despite being promoted as 'objective' and 'neutral', these systems incorporate prejudices and biases, strong commercial interests, third-party contracts and layers of bureaucratic administration. Nevertheless, this technology is implemented on a large scale and it therefore deeply impacts what—on a large scale—is found and lost, what is considered relevant and irrelevant, what is preserved and passed on to future generations and what will not be, what is researched and studied and what will not be accessed.

Understanding digital objects as post-authentic entails being mindful of all the alterations and transformations occurring prior to accessing the digital record and how each one of them is connected to wider networks of systems, factors and complexities, most of which are invisible and unpredictable. Similarly, any following intervention adds further layers of manipulation and transformation which incorporate the previous ones and which will in turn have future, unpredictable consequences. For example, in Sect. 2.2 I discussed how previous decisions about what was worth digitising dictated which languages needed to be prioritised, in turn determining which training data-sets were compiled for different language models, leading to the current strong bias towards English models, datasets and tools and an overall digital language and cultural injustice.

Although the non-English content in *Chronicling America* has been reviewed by language experts, many additional OCR errors may have originated from markings on the material pages or a general poor condition of the physical object. Again, the specifcity of the source adds further complexity to the many problematic factors involved in its digitisation; in the case of *ChroniclItaly 3.0*, for example, we found that OCR errors were often rendered as numbers and special characters. To alleviate this issue, we decided to remove such items from the collection. This step impacted differently on the material, not just across titles but even across issues of the same title. Figure 3.1 shows, for example, the impact of this operation on *Cronaca Sovversiva*, one of the newspapers collected in *ChroniclItaly 3.0* with the longest publication record, spanning almost throughout the entire archive, 1903–1919. On the whole, this intervention reduced the total number of tokens from 30,752,942 to 21,454,455, equal to about 30% of overall material removed (Fig. 3.2). Although with sometimes substantial variation, we found the overall OCR quality to be generally better in the

**Fig. 3.1** Variation of removed material (in percentage) across issues/years of *Cronaca Sovversiva*

**Fig. 3.2** Impact of pre-processing operations on *ChroniclItaly 3.0* per title. Figure taken from Viola and Fiscarelli (2021b)

most recent texts. This characteristic is shared by most OCRed nineteenthcentury newspapers, and it has been ascribed to a better conservation status or better initial condition of the originals which overall improved over time (Beals and Bell 2020). Figure 3.3 shows the variation of removed material in *L'Italia*, the largest newspaper in the collection comprising 6489 issues published uninterruptedly from 1897 to 1919.

Finally, my experience of previously working on the *GeoNewsMiner* (Viola et al. 2019) (GNM) project also infuenced the decisions we took when enriching *ChroniclItaly 3.0*. As said in Sect. 2.4, GNM loads *ChroniclItaly 2.0*, the version of the *ChroniclItaly* collections annotated with referential entities without having performed any of the pre-processing tasks described here in reference to *ChroniclItaly 3.0*. A post-tagging manual check revealed that, even though the F1 score of the NER model—that is the measure to test a model's accuracy—was 82.88, due to OCR errors, the locations occurring less than eight times were in fact false positives (Viola et al. 2019; Viola and Verheul 2020a). Hence, the interventions we made on *ChroniclItaly3.0* aimed at reducing the OCR errors to increase the discoverability of elements that were not identifed in the GNM project. When researchers are not involved in the creation

**Fig. 3.3** Variation of removed material (in percentage) across issues/years of *L' Italia*

of the applied algorithms or in choosing the data-sets for training them which especially in the humanities represents the majority of cases—and consequently when tools, models and methods are simply reused as part of the available resources, the post-authentic framework can provide a critical methodological approach to address the many challenges involved in the process of digital knowledge creation.

The illustrated examples demonstrate the complex interactions between the materiality of the source and the digital object, between the enrichment operations and the concurrent curator's context, and even among the enrichment operations themselves. The post-authentic framework highlights the artifciality of any notion conceptualising digital objects as *copies*, unproblematised and disconnected from the material object. Indeed, understanding digital objects as post-authentic means acknowledging the continuous fow of interactions between the multiple factors at play, only some of which I have discussed here. Particularly in the context of digital cultural heritage, it means acknowledging the curators' awareness that the past is written in the present and so it functions as a warning against ignoring the collective memory dimension of what is created, that is the importance of being digital.

# 3.3 NER AND GEOLOCATION

In addition to the typical motivations for annotating a collection with referential entities such as sorting unstructured data and retrieving potentially important information, my decision to annotate *ChroniclItaly 3.0* using NER, geocoding and SA was also closely related to the nature of the collection itself, i.e., the specifcity of the source. One of the richest values of engaging with records of migrants' narratives is the possibility to study how questions of cultural identities and nationhood are connected with different aspects of social cohesion in transnational, multicultural and multilingual contexts, particularly as a social consequence of migration. Produced by the migrants themselves and published in their native language, ethnic newspapers such as those collected in *ChroniclItaly 3.0* function in a complex context of displacement, and as such, they offer deep, subjective insights into the experience and agency of human migration (Harris 1976; Wilding 2007; Bakewell and Binaisa 2016; Boccagni and Schrooten 2018).

Ethnic newspapers, for instance, provide extensive material for investigating the socio-cognitive dimension of migration through markers of identity. Markers of identity can be cultural, social or biological such as artefacts, family or clan names, marriage traditions and food practices, to name but a few (Story and Walker 2016). Through shared claims of ethnic identity, these markers are essential to communities for maintaining internal cohesion and negotiating social inclusion (Viola and Verheul 2019a). But in diasporic contexts, markers of identity can also reveal the changing subtle renegotiations of migrants' cultural affliation in mediating interests of the homeland with the host environment. Especially when connected with entities such as places, people and organisations, these markers can be part of collective narratives of pride, nostalgia or loss, and their analysis may therefore bring insights into how cultural markers of identity and ethnicity are formed and negotiated and how displaced individuals make sense of their migratory experience. The ever-larger amount of available digital sources, however, has created a complexity that cannot easily be navigated, certainly not through close reading methods alone. Computational methods such as NER methodologies, though presenting limitations and challenges, can help identify names of people, places, brands and organisations thus providing a way to identify markers of identity on a large scale.

We annotated *ChroniclItaly 3.0* by using a NER deep learning sequence tagging tool (Riedl and Padó 2018) which identifed 547,667 entities occurring 1,296,318 times across the ten titles.3 A close analysis of the output, however, revealed a number of issues which required a critical intervention combining expert knowledge and technical ability. In some cases, for example, entities had been assigned the wrong tag (e.g., 'New York' tagged as a person), other times elements referring to the same entity had been tagged as different entities (e.g., 'Woodrow Wilson', 'President Woodrow Wilson'), and in some other cases elements identifed as entities were not entities at all (e.g., *venerdí* 'Friday' tagged as an organisation). To avoid the risk of introducing new errors, we intervened on the collection manually; we performed this task by frst conducting a thorough historical triangulation of the entities and then by compiling a list of the most frequent historical entities that had been attributed the wrong tag. Although it was not possible to 'repair.' All the tags, this post-tagging intervention affected the redistribution of 25,713 entities across all the categories and titles, signifcantly improving the accuracy of the tags that would serve as the basis for the subsequent enrichment operations (i.e., geocoding and SA). Figure 3.4 shows how in some cases the redistribution caused a substantial variation: for example, the number of entities in the LOC (location) category signifcantly decreased in *La Rassegna* but it increased in *L'Italia*. The documentation of these processes of transformation is available Open Access<sup>4</sup> and acts as a way to acknowledge them as problematic, as undergoing several layers of manipulation and interventions, including the multidirectional relationships between the specifcity of the source, the digitised material and all the surrounding factors at play. Ultimately, the post-authentic framework to digital objects frames digital knowledge creation as honest and accountable, unfnished and receptive to alternatives.

Once entities in *ChroniclItaly 3.0* were identifed, annotated and verifed, we decided to geocode places and locations and to subsequently visualise their distribution on a map. Especially in the case of large collections with hundreds of thousands of such entities, their visualisation may greatly facilitate the discovery of deeper layers of meaning that may otherwise be largely or totally obscured by the abundance of material available. I will discuss the challenges of visualising digital objects in Chap. 5 and illustrate how the post-authentic framework can guide both

**Fig. 3.4** Distribution of entities per title after intervention. Positive bars indicate a decreased number of entities after the process, whilst negative bars indicate an increased number. Figure taken from Viola and Fiscarelli (2021b)

the development of a UI and the encoding of criticism into graphical display approaches.

Performing geocoding as an enrichment intervention is another example of how the process of digital knowledge creation is inextricably entangled with external dynamics and processes, dominant power structures and past and current systems in an intricate net of complexities. In the case of *ChroniclItaly 3.0*, for instance, the process of enriching the collection with geocoding information shares much of the same challenges as with any material whose language is not English. Indeed, the relative scarcity of certain computational resources available for languages other than English as already discussed often dictates which tasks can be performed, with which tools and through which platforms. Practitioners and scholars as well as curators of digital sources often have to choose between either creating resources *ad hoc*, e.g., developing new algorithms, fne-tuning existing ones, training their own models according to their specifc needs or more simply using the resources available to them. Either option may not be ideal or even at all possible, however. For example, due to time or resources limitations or to lack of specifc expertise, the frst approach may not be economically or technically feasible. On the other hand, even when models and tools in the language of the collection do exist—like in the case of *ChroniclItaly 3.0*—typically their creation would have occurred within the context of another project and for other purposes, possibly using training data-sets with very different characteristics from the material one is enriching. This often means that the curator of the enrichment process must inevitably make compromises with the methodological ideal. For example, in the case of *ChroniclItaly 3.0*, in the interest of time, we annotated the collection using an already existing Italian NER model. The manual annotation of parts of the collection to train an *ad hoc* model would have certainly yielded much more accurate results but it would have been a costly, lengthy and labour-intensive operation. On the other hand, while being able to use an already existing model was certainly helpful and provided an acceptable F1 score, it also resulted in a poor individual performance for the detection of the entity LOC (locations) (54.19%) (Viola and Fiscarelli 2021a). This may have been due to several factors such as a lack of LOC-category entities in the data-set used for originally training the NER model or a difference between the types of LOC entities in the training data-set and the ones in *ChroniclItaly 3.0*. Regardless of the reason, due to the low score, we decided to not geocode (and therefore visualise) the entities tagged as LOC; they can however still be explored, for example, as part of SA or in the GitHub documentation available Open Access. Though not optimal, this decision was motivated also by the fact that geopolitical entities (GPE) are generally more informative than LOC entities as they typically refer to countries and cities (though sometimes the algorithm retrieved also counties and States), whereas LOC entities are typically rivers, lakes and geographical areas (e.g., the Pacifc Ocean). However, users should be aware that the entities currently geocoded are by no means all the places and locations mentioned in the collection; future work may also focus on performing NER using a more fne-tuned algorithm so that the LOC-type entities could also be geocoded.

# 3.4 SENTIMENT ANALYSIS

Annotating textual material for attitudes—either sentiment or opinions through a method called sentiment analysis (SA) is another enriching technique that can add value to digital material. This method aims to identify the prevailing emotional attitude in a given text, though it often remains unclear whether the method detects the attitude of the writer or the expressed polarity in the analysed textual fragment (Puschmann and Powell 2018). Within DeXTER, we used SA to identify the prevailing emotional attitude towards referential entities in *ChroniclItaly 3.0*. Our intention was twofold: frstly, to obtain a more targeted enrichment experience than it would have been possible by applying SA to the entire collection and, secondly, to study referential entities as markers of identity so as to access the layers of meaning migrants attached historically to people, organisations and geographical spaces. Through the analysis of the meaning humans invested in such entities, our goal was to delve into how their collective emotional narratives may have changed over time (Tally 2011; Donaldson et al. 2017; Taylor et al. 2018; Viola and Verheul 2020a). Because of the specifc nature of *ChroniclItaly 3.0*, this exploration inevitably intersects with understanding how questions of cultural identities and nationhood were connected with different aspects of social cohesion (e.g., transnationalism, multiculturalism, multilingualism), how processes of social inclusion unfolded in the context of the Italian American diaspora, how Italian migrants managed competing feelings of belonging and how these may have changed over time.

SA is undoubtedly a powerful tool that can facilitate the retrieval of valuable information when exploring large quantities of textual material. Understanding SA within the post-authentic framework, however, means recognising that specifc assumptions about what constitutes valuable information, what is understood by sentiment and how it is understood and assessed guided the devise of the technique. All these assumptions are invisible to the user; the post-authentic framework warns the analyst to be wary of the indiscriminate use of the technique. Indeed, like other techniques used to augment digital objects including digital heritage material, SA did not originate within the humanities; SA is a computational linguistics method developed within natural language processing (NLP) studies as a subfeld of information retrieval (IR). In the context of visualisation methods, Johanna Drucker has long discussed the dangers of a blind and unproblematised application of approaches brought in the humanities from other disciplines, including computer science. Particularly about the specifc assumptions at the foundation of these techniques, she points out, 'These assumptions are cloaked in a rhetoric taken wholesale from the techniques of the empirical sciences that conceals their epistemological biases under a guise of familiarity' (Drucker 2011, 1). In Chap. 4, I will discuss the implications of a very closely related issue, the metaphorical use of everyday lexicon such as '*sentiment* analysis', '*topic* modelling' and 'machine *learning*' as a way to create familiar images whilst however referring to rather different concepts from what is generally internalised in the collective image. In the case of SA, for example, the use of the familiar word 'sentiment' conceals the fact that this technique was specifcally designed to infer general opinions from product reviews and that, accordingly, it was not conceived for empirical social research but frst and foremost as an economic instrument.

The application of SA in domains different from its original conception poses several challenges which are well known to computational linguists the techniques' creators—but perhaps less known to others; whilst opinions about products and services are not typically problematic as this is precisely the task for which SA was developed, due to their much higher linguistic and cultural complexity, opinions about social and political issues are much harder to tackle. This is due to the fact that SA algorithms lack suffcient background knowledge of the local social and political contexts, not to mention the challenges of detecting and interpreting sarcasm, puns, plays on words and ironies (Liu 2020). Thus, although most SA techniques will score opinions about products and services fairly accurately, they will likely perform poorly when based on opinionated social and political texts. This limitation therefore makes the use of SA problematic when other disciplines such as the humanities and the social sciences borrow it uncritically, worse yet it raises disturbing questions when the technique is embedded in a range of algorithmic decision-making systems based, for instance, on content mined from social media. For example, since its explosion in the early 2000s, SA has been heavily used in domains of society that transcend the method's original conception: it is constantly applied to make stock market predictions and in the health sector and by government agencies to analyse citizens' attitudes or concerns (Liu 2020).

In this already overcrowded landscape of interdependent factors, there is another element that adds yet more complexity to the matter. As with other computational techniques, the discourse around SA depicts the method as detached from any subjectivity, as a technique that provides a neutral and observable description of reality. In their analysis of the cultural perception of SA in research and the news media, Puschmann and Powell (2018) highlight for example how the public perception of SA is misaligned with its original function and how such misalignment 'may create epistemological expectations that the method cannot fulfll due to its technical properties and narrow (and well-defned) original application to product reviews' (2). Indeed, we are told that SA is a quantitative method that provides us with a picture of opinionated trends in large amounts of material otherwise impossible to map. In reality, the reduction of something as idiosyncratic as the defnition of human emotions to two/three categories is highly problematic as it hides the whole set of assumptions behind the very establishment of such categories. For example, it remains unclear what is meant by neutral, positive or negative as these labels are typically presented as a given, as if these were unambiguous categories universally accepted (Puschmann and Powell 2018). On the contrary, to put it in Drucker's words 'the basic categories of supposedly quantitative information […] are already interpreted expressions' (Drucker 2011, 4).

Through the lens of the post-authentic framework, the application of SA is acknowledged as problematic and so is the intrinsic nature of the technique itself. A SA task is usually modelled as a classifcation problem, that is, a classifer processes pre-defned elements in a text (e.g., sentences), and it returns a category (e.g., positive, negative or neutral). Although there are so-called fne-grained classifers which attempt to provide a more nuanced distinction of the identifed sentiment (e.g., very positive, positive, neutral, negative, very negative) and some others even return a prediction of the specifc corresponding sentiment (e.g., anger, happiness, sadness), in the post-authentic framework, it is recognised that it is the fundamental notion of sentiment as discrete, stable, fxed and objective that is highly problematic. In Chap. 4, I will return to this concept of discrete modelling of information with specifc reference to ambiguous material, such as cultural heritage texts; for now, I will discuss the issues concerning the discretisation of linguistic categories, a well-known linguistic problem.

In his classic book *Foundations of Cognitive Grammar*, Ronald Langacker (1983) famously pointed out how it is simply not possible to unequivocally defne linguistic categories; this is because language does not exist in a vacuum and all human exchanges are always context-bound, viewpointed and processual (see Langacker 1983; Talmy 2000; Croft and Cruse 2004; Dancygier and Sweetser 2012; Gärdenfors 2014; Paradis 2015). In felds such as corpus linguistics, for example, which heavily rely on manually annotated language material, disagreement between human annotators on same annotation decisions is in fact expected and taken into account when drawing linguistic conclusions. This factor is known as 'inter-annotator agreement' and it is rendered as a measure that calculates the agreement between the annotators' decisions about a label. The inter-annotator agreement measure is typically a percentage and depends on many factors (e.g., number of annotators, number of categories, type of text); it can therefore vary greatly, but generally speaking, it is never expected to be 100%. Indeed, in the case of linguistic elements whose annotation is highly subjective because it is inseparable from the annotators' culture, personal experiences, values and beliefs—such as the perception of sentiment—this percentage has been found to remain at 60–65% at best (Bobicev and Sokolova 2018).

The post-authentic framework to digital knowledge creation introduces a counter-narrative in the main discourse that oversimplifes automated algorithmic methods such as SA as objective and unproblematic and encourages a more honest conversation across felds and in society. It acknowledges and openly addresses the interrelations between the chosen technique and its deep entrenchment in the system that generated it. In the case of SA, it advocates more honesty and transparency when describing how the sentiment categories have been identifed, how the classifcation has been conducted, what the scores actually mean, how the results have been aggregated and so on. At the very least, an acknowledgement of such complexities should be present when using these techniques. For example, rather than describing the results as fnite, unquestionable, objective and certain, a post-authentic use of SA incorporates full disclosure of the complexities and ambiguities of the processes involved. This would contribute to ensuring accountability when these analytical systems are used in domains outside of their original conception, when they are implemented to base centralised decisions that affect citizens and society at large or when they are used to interpret the past or write the future past.

The decision of how to defne the *scope* (see for instance Miner 2012) prior to applying SA is a good example of how the post-authentic framework can inform the implementation of these techniques for knowledge creation in the digital. The defnition of the scope includes defning problematic concepts of what constitutes a text, a paragraph or a sentence, and how each one of these defnitions impacts on the returned output, which in turn impacts on the digitally mediated presentation of knowledge. In other words, in addition to the already noted caveats of applying SA particularly for social empirical research, the post-authentic framework recognises the full range of complexities derived from preparing the material, a process—as I have discussed in Sect. 3.2—made up of countless decisions and judgement calls. The post-authentic framework acknowledges these decisions as always situated, deeply entrenched in internal and external dynamics of interpretation and management which are themselves constructed and biased. For example, when preparing *ChroniclItaly 3.0* for SA, we decided that the *scope* was 'a sentence' which we defned as the portion of text: (1) delimited by punctuation (i.e., full stop, semicolon, colon, exclamation mark, question mark) and (2) containing only the most frequent entities. If, on the one hand, this approach considerably reduced processing time and costs, on the other hand, it may have caused less mentioned entities to be underrepresented. To at least partially overcome this limitation, we used the logarithmic function 2\*log25 to obtain a more homogeneous distribution of entities across the different titles, as shown in Fig. 3.5.

As for the implementation of SA itself, due to the lack of suitable SA models for Italian when DeXTER was carried out, we used the *Google Cloud Natural Language Sentiment Analysis*<sup>6</sup> API (Application Programming Interface) within the *Google Cloud Platform Console,*<sup>7</sup> a console of technologies which also includes NLP applications in a wide range of languages. The SA API returned two values: sentiment score and sentiment magnitude. According to the available documentation provided by Google,8 the sentiment score—which ranges from <sup>−</sup>1 to 1—indicates the overall emotion polarity of the processed text (e.g., positive, negative, neutral), whereas the sentiment magnitude indicates how much emotional content is present within the document; the latter value is often proportional to the length of the analysed text. The sentiment magnitude ranges from 0 to 1, whereby 0 indicates what Google defnes as 'low-emotion content' and 1 indicates 'high-emotion content', regardless of whether the emotion is identifed as positive or negative. The magnitude value is meant to help differentiate between low-emotion and mixed-emotion cases, as they would both be scored as neutral by the algorithm. As such, it alleviates the issue of reducing something as vague and subjective as the perception of emotions to three rigid and unproblematised categories. However, the

**Fig. 3.5** Logarithmic distribution of selected entities for SA across titles. Figure taken from Viola and Fiscarelli (2021b)

post-authentic framework recognises that any conclusion based on results derived from SA should acknowledge a degree of inconsistency between the way the categories of positive, negative and neutral emotion have been defned in the training model and the writer's intention in the actual material to which the model is applied. Specifcally, the *Google Cloud Natural Language Sentiment Analysis* algorithm differentiates between positive and negative emotion in a document, but it does not specify what is meant by positive or negative. For example, if in the model sentiments such as 'angry' and 'sad' are both categorised as negative emotions regardless of their context, the algorithm will identify either text as negative, not 'sad' or 'angry', thus creating further ambiguity to the already problematic and non-transparent way in which 'sad' and 'angry' were originally defned and categorised. To marginally deal with this issue, we established a threshold within the sentiment range for defning 'clearly positive' (i.e., *>*0.3) and 'clearly negative' cases (i.e., *<*−0.3). The downside of this approach was however that the algorithm considered all the cases between these two values as neutral/mixed-emotion cases which inevitably led to a fattening of nuances. In Chap. 5, I will return to the ambiguities of SA when discussing the design choices for developing the DeXTER app, the interactive visualisation tool to explore *ChroniclItaly 3.0*, and I will present suggestions towards visualising the complexities and uncertainties in datamodels and visualisation techniques.

The application of the post-authentic framework to SA highlights that the technique is far from being methodologically ideal and it calls attention to all the uncertainties of using it in felds other than IR and for tasks other than product review, as the use case discussed here. The post-authentic framework acts therefore as a warning against these shortcomings and creates a space for accountability for the adopted curatorial decisions. Within DeXTER and *ChroniclItaly 3.0*, we thoroughly documented such decisions which can be accessed through the openly available dedicated GitHub repository9 which also includes the code, links to the original and processed material, and the fles documenting the manual interventions. Ultimately, the post-authentic framework counterbalances the main public discourse—separate from computational research—which promotes SA as an exact way to measure emotions and opinions, it recognises when its use is disconnected from its original purpose, and it accordingly advocates the reworking of the user's epistemological expectations.

In this respect, the implementation of the post-authentic framework for knowledge creation in the digital relates to one of the central pillars of science, that of replicability (or reproducibility/repeatability).10 The principle postulates that following a study's detailed descriptions, claims and conclusions obtained by scientists can be verifed by others. This is done in the name of transparency, traceability and accountability, which are also fundamental aspects of post-authentic work. The difference however lies in the purpose of these fundamental notions; whereas in science they are primarily aimed at allowing independent confrmation of a study's results, within the post-authentic framework, they are not solely concerned with this specifc scientifc goal and in fact they move beyond it. For example, traditionally, a study is believed to be replicable if suffcient transparency has been observed on the data, the research purposes, the method, the conclusions, etc. and yet some studies can be perfectly transparent and not at all replicable (Peels 2019; Viola 2020b). This is, for instance, believed to be the case especially in the humanities for which the very nature of some studies can make replication impossible, for example, due to a particularly interpretative analysis (Peels, 2019).

On the opposite end of the scale, empirical works are believed to be—at least in theory—fully replicable. Thus, despite the still unresolved debate on the 'R-words', over the years, protocols and standards for replication in science have been perfectioned and systematised. When computers started to be used for experiments and data analysis, things turned complicated. Plesser (2018), for instance, explains how it became apparent that the canonical margins for experimental error did not somehow apply to digital research:

Since digital computers are exact machines, practitioners apparently assumed that results obtained by computer could be trusted, provided that the principal algorithms and methods employed were suitable to the problem at hand. Little attention was paid to the correctness of implementation, potential for error, or variation introduced by system soft- and hardware, and to how diffcult it could be to actually reconstruct after some years—or even weeks—how precisely one had performed a computational experiment. (Plesser 2018, 1)

The post-authentic framework is comfortable with the belief that attainability of complete objectivity (and therefore perfect replicability) is always but an illusion. Indeed, the post-authentic relevance of transparency, traceability and consequently, accountability lies primarily in the acknowledgement of a collective responsibility, the one that comes with the building of a source of knowledge for current and future generations. Thus, within the post-authentic framework being transparent about both the 'raw' and the processed material, about the methodology, the analytical processes and the tools assumes a whole new importance: the creation of other digital forms which allow to trace technical obsolescence, acknowledge power relations and attempt to fuidly incorporate the exchanges that lead to symbiosis, not friction, across interactions. As argued by Fiona Cameron with regard to digital cultural heritage (2021, 12):

[digital cultural heritage] encapsulate[s] other registers of signifcance, temporality and agency such as planetary technological infrastructures, material agency, non-human, elemental, and earthly processes, all of which are invisible fgures in their constitution.

The post-authentic framework for digital knowledge creation recognises that whatever arises out of the confuence of all these different agencies cannot be fully predicted. The role of documentation by researchers, museums, archives, libraries, software developers and so on acts therefore as a means to acknowledge that we are writing the future past and that writing the past means controlling the future. The post-authentic framework provides an architecture to meet the need for accountability to current and future generations.

Finally, the documentation of the interventions has wider resonance particularly in relation to increasing awareness towards sustainability in digital knowledge creation. In June 2020, the UN published the *Roadmap for Digital Cooperation* report which set a list of key actions to be achieved by 2030 in order to advance a more equitable digital world. Whilst acknowledging that 'Meaningful participation in today's digital age requires a high-speed broadband connection to the Internet' (United Nations 2020b, 5), the report also highlights that half of the world's population (3.7 billion people) currently does not have access to the Internet. The lack of digital access, also commonly referred to as 'Digital Divide', affects those mostly located in least developed countries (LDCs), landlocked developing countries (LLDCs) and small island developing states (SIDS) with an even more acute gap in countries such as sub-Saharan Africa, where only 11% have access to household computers and 82% lack Internet access altogether.

The digital inequality worsens the already existing inequalities in society as those who are the most vulnerable are disproportionately affected by the divide. Based as it is on a universal vision of digital transformation, current digital knowledge creation practices face therefore not only the danger of being available exclusively to half of humanity but also of yet again imposing Western-centred perspectives on how knowledge is created and accessed. The future looks ever more digital and digitally available repositories will become larger and larger; reconceptualising digital objects within the post-authentic framework means also fostering their reconceptualisation not just in terms of what we are digitising but also *how* and *for whom*. In this sense, the creation, curation, analysis and visualisation of digital objects should whenever possible prefer methods and practices that make curatorial workfows sustainable, interoperable and reusable. This should include the storage of the material in an Open Access repository, the use of freely available and fully documented software and a thorough documentation of the implemented steps and interventions, including an explanation of the choices made which will in turn facilitate research accessibility, transparency and dissemination.

In the next chapter, I will illustrate the third use case of the book, the application of the post-authentic framework to digital analysis. Through the example of topic modelling, I will show how the post-authentic framework can guide a deep understanding of the assemblage of culture and technology in software and help us achieve the interpretative potential of computation. I will specifcally discuss the implications for knowledge creation of the transformation of *continuous* material into *discrete* form binary sequences of 0s and 1s—with particular reference to the notions of causality and correlations. Within this broader discussion, I will then illustrate the example of topic modelling as a computational technique that treats a collection of texts as discrete data, and I will focus on the critical aspects of topic modelling that are highly dependent on the sources: pre-processing, corpus preparation and deciding on the number of topics. The topic modelling example ultimately shows how producing digital knowledge requires sustained engagement with software, in the form of fuid, symbiotic exchanges between processes and sources.

## NOTES


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# How Discrete

If you torture the data long enough, it will confess to anything. (Attributed to Ronald H. Coase, 1960)

# 4.1 METAPHORS WITH DESTINY

Metaphors are fascinating and powerful linguistic devices. Over the years, numerous scholars have indeed extensively explored their manipulative talent for creating realities (see for instance, Lakoff 1992, 2004, 2008; Goatly 2007; Mio and Katz 2016). In the context of political discourse alone, for example, the study of metaphors' capacity to hide or popularise latent ideologies, justify or blame governments' decisions, or strategically attribute blame goes back decades (e.g., Musolff 2004, 2010, 2014; Goatly 2007; Ottatti et al. 2014; Viola 2020a). Though extremely powerful— 'Metaphors can kill' (Lakoff 1992, 1)—metaphors are neither good nor bad per se; we simply routinely use them, often rather unrefectively, so that abstract and complex ideas can be processed in a cognitively simplifed way (ibid.). What makes metaphors so effective, particularly conceptual metaphors, is their use of conceptual frames such as war, disease, sport, family, religion and others which, by evoking mental images that are familiar to the message receivers, can turn complex concepts into a simple, linear logic (Viola 2020a). It is thanks to this 'framing power' that metaphors' arguments become plausible and the proposed conclusions are perceived as unproblematic and even 'self-evident' (Musolff 2016, 133). Moreover, as we mostly use metaphors implicitly, such framing power remains typically unnoticed and so do metaphors. So, for example, in the context of the COVID-19 pandemic, when commenting on the effectiveness of Italy's decision to institute national lockdown, French Prime Minister at the time Édouard Philippe said, 'To block the country does not allow to contain the epidemy'1 (Valeurs actuelles 2020). At the time when the comment was made, France was adopting much less drastic measures compared to Italy; therefore, the differences in the two countries' crisis management approaches needed to be justifed, and in order to be accepted by the nation, the domestic strategy had to be presented to the public as the best possible solution (Viola 2022). In this particular example, the framing power is conveyed by the expression *to block the country*: the metaphorical use of the verb *to block* frames the Italian lockdown measure not only as overly aggressive but wrongly targeted: it is the country that is put to a halt, not the spread of the virus.

But metaphors are not typically found just in political discourse; scientifc discourse also regularly exploits the power of metaphors to simplify complex concepts. In 2003, Blei et al. published a study which, at the moment of writing, counts 36,483 citations (2003). The paper tackled the task of modelling a collection of discrete data, for example, a corpus of texts, for effcient processing tasks such as classifcation and content summarisation. The authors' basic idea was to model each item in the collection, e.g., each text, according to the Latent Dirichlet Allocation (LDA) model, a generative probabilistic model for which documents are represented as distributions of sets of words statistically likely to occur together. Although the article itself was titled 'Latent Dirichlet Allocation', the technique described in the article went down in history as *topic modelling*. The reason for that may be found in the fact that the authors had decided to name the above-mentioned sets of words as 'topics', albeit their intention was not to make epistemological claims regarding the latent variables but to simply 'exploit text-oriented intuitions' (996), that is, to take advantage of a familiar image such as that of *topics*. In other words, the term *topic* was used metaphorically.

A similar observation about the metaphorical use of everyday notions to refer to techniques which are however based on specifc, rather different, principles may also apply to computational techniques such as '*sentiment* analysis' and 'machine *learning*'. The metaphorical use of the terms 'sentiment', 'learning' and 'topic' may be harmless within the felds that have devised such techniques because the principles upon which they are based are very clearly defned by their creators and understood in those circles. It may on the contrary have huge consequences when these methods are passively transferred into other disciplines or practices. In his analysis of informational approaches in cancer biology research, Longo (2018), for example, critiques the extensive use of computer science terminology such as 'instructions', 'to reprogram a deprogrammed DNA' and in general the DNA described as a computer program and genes as information carriers. He argues (88):

The informational approach in biology confates the concept of programming on discrete data with the common-sense understanding of 'information' and 'computer program', which are vaguely familiar to everybody [...] In fact, the use of 'information' and 'programming' in biology is not scientifc because it neither applies the mathematical invariants proper to information and programming, nor the theorems proper to the corresponding scientifc disciplines. Instead, it transfers a vague, everyday notion and refers to 'weak' meanings.

Longo argues that the metaphorical use of mathematical and computational language has had enormous consequences for molecular biology cancer research which essentially studies cancer as the result of DNA deprogramming, inherited or otherwise caused by a carcinogen that disrupts the DNA 'encoded instructions' (92). The use of an everyday notion such as that of 'program', he continues, has also no doubt facilitated understanding among funding agencies and the public, perhaps even leading to the exclusion of alternative hypotheses. Similarly, one might argue that it is the metaphorical use of the word *topic* that explains why topic modelling has become so popular beyond computer science and in the humanities in particular: whereas not everyone may be an expert in statistical modelling, we are all more or less familiar with a fairly general conceptualisation of what a topic is. However, what humanities scholars may have not been too familiar with—and to a large extent, still aren't—is the set of assumptions behind a method born in the computer sciences and adopted in critical research.

The popularity of topic modelling beyond computer science (as well as SA and ML) is closely related to another phenomenon, well-known in linguistics: when a metaphor is adopted by a signifcant part of the linguistic community, language users may no longer be aware of its metaphorical use, the metaphor becomes a common meaning and so it dies (Ricœur 2003, 115). The metaphorical use of 'sentiment', 'learning' and 'topic', I will argue here, has certainly contributed to make these techniques very popular outside of their feld of origin. At the same time, however, precisely because of this popularity, these meanings have become common meanings, i.e., 'dead metaphors'. This in turn has major consequences: the creation of epistemological expectations that these methods will obviously disappoint (Puschmann and Powell 2018). For example, as I have discussed in Chap. 3 in reference to SA, the familiar word 'sentiment' creates a specifc epistemological expectation, that it is somewhat possible to obtain a neutral way to assess attitudes and moods in large quantities of material. Assessment, however, requires language understanding as a prerequisite and when it comes to machines, this is exactly what they are not able to do. The post-authentic framework that I advance in this book serves also as a reminder that these terms are used as mere metaphors.

In the next section, I will discuss a more concerning aspect concealed by the use of vague, familiar notions such as 'sentiment', 'learning' and 'topic': the underlying process upon which these techniques are based, i.e., the elaboration of continuous information into discrete systems and the implications for causality. In discrete systems, causality is hidden because information is rendered as exact and separate points, all encoded in one dimension and according to precise instructions (Longo 2018). The threedimensional, causal essence of information cannot be accessed by the user who, instead, is offered an altered image made up of predictions of correlations. The resulting information will still refer to its original continuous structure, but computers will only render it as a sequence of 0s and 1s, that is in discrete form, thus hiding relational causality.

In the case of SA, this distorted image is refected in the reduction of the subjectivity of human emotions to two/three categories, scored according to probabilistic calculations; in the case of ML, the holistic, human capacity to acquire knowledge and skills through experience, logic and contextual factors is reduced to the probabilistic processing of huge, yet partial, quantities of discrete data; in the case of topic modelling, the text itself disappears and so does its entrenchment in the wider context that produced it. In all these cases, the three-dimensional, causal structure is no longer accessible nor is its historical and social susceptibility as it is all dissembled by the computational, dualistic system of 0s and 1s. This confation of discrete data modelling with familiar notions such as 'sentiment', 'learning' and 'topic' has therefore certainly contributed to make these methods extremely popular outside their felds of origin, but at the same time, it has obfuscated the well-defned laws upon which they are based. Longo claims:

This is an amazing technological achievement: by fne engineering, one may forget the underlying physical hardware and its continuous fows and just consider (and work on) the discrete software processes by writing alphanumeric programs. (Longo 2018, 87)

In a world where all information is digital, the consequence of this amazing technological achievement is that it also presents a distorted image of knowledge because, to paraphrase David Tong, the world does not seem to be discrete (Tong 2011).

In this chapter, I frst examine the implications of adopting discrete methods and technologies not just as quantitative tools in the humanities but for knowledge production in general and, more widely, for our understanding of society. Specifcally, I refect on the notions of causality and correlations in light of the considerations discussed so far about the mythicised discourse on data and technology neutrality, the dangers of using metaphorical language to refer to digital technologies and the consequential urgent need for knowledge reconfguration inspired by symbiosis and mutualism. I then proceed to examine the text mining technique of topic modelling and the premises on which it is based with a special focus on its use of discrete mathematics to encode information. Finally, I illustrate how applying the post-authentic framework to topic modelling can facilitate critical engagement with this technique, especially in humanities research.

In my discussion, I argue that such engagement can only happen by maintaining a sustained connection with the digital object and I demonstrate how the application of key post-authentic concepts and methods can be especially effective at three decisive stages in a topic modelling workfow: pre-processing, corpus preparation and choosing the number of topics. The post-authentic framework, as the analysis will show, may be especially effective at prompting the active and refexive participation of the user in the process of knowledge production in the digital. In the next section, I start my argument by discussing the implications of the 'big data philosophy', that is, the obsession with patterns and correlations as opposed to causation, to explain phenomena; I also examine such implications in relation to topic modelling and its use for knowledge creation, in humanistic enquiry and beyond.

# 4.2 CAUSALITY, CORRELATIONS, PATTERNS

Perhaps one of the most signifcant implications of the 'Digital Turn' in the humanities, more widely in the natural, computational and social sciences, and more widely still in relation to the digitisation of society is contained in the notion of discrete vs continuous modelling of information. The concepts of discrete and continuous and the tension between the two are at the foundation of mathematical thought and of how mathematical modelling is used to explain natural phenomena (Fenstad 1985). A way to understand the crucial difference between discrete and continuous structures is to consider that in a discrete structure, all points are isolated and completely disconnected from each other; one can therefore label them and count them and their count is exact and absolute. On the contrary, one can only access a continuous structure by measuring it and these measurements create intervals or fractions of intervals; moreover, in the continuous, a scale for the measurement has to be set (Longo 2018, 84). Therefore, in discrete systems, there is no room for approximation, no uncertainty, no nuances, as something is either one point or another, whereas in the continuous—since phenomena can only be accessed by measuring them—the measurements are always approximated (Longo 2019, 64–65).

Even without going too deep into the full mathematical (and physical!) ramifcations of these two notions, one can intuitively understand that they refer to very different ways of mathematical thinking. A fundamental difference particularly relevant to the arguments advanced in this book is concerned with the understanding of causality, a notion whose theoretical conceptualisation from philosophy to physics can be traced back to antiquity.2 For the sake of the argument advanced in this chapter, I will summarise the discussion by saying that in the classical worldview which prevailed until the twentieth century, a mechanistic notion strongly identifed causation with determinism. Determinism can be understood as the ability to determine the future state of a physical system from its present state (Weinert 2005, 196). According to this view, also known as functional view of causation, every event has a unique cause that precedes it (de Laplace 1820; Stigler 1986; Cpek and ˇ Capek 1961), ˇ and therefore the world is seen as an 'uninterrupted chain of causes and effects' (Holbach 1770). This view has been criticised over the course of the twentieth century for several shortcomings such as the proximity of elements in determining cause-effect relationships, predictability as the main criterion for establishing causation and the reduction of causality essentially to a mere temporal relationship. Discoveries of and advances in differential equations, atomic physics and quantum mechanics have further consolidated such criticisms eventually leading to the current separation of causality from determinism. Particularly in quantum mechanics, recent experiments have provided strong evidence for the validity of this notion of causality without determinism. In this view, consequent states of a quantum system are related to its antecedent states by a form of conditional dependency (Weinert 2005, 241) as opposed to every event having a unique cause that precedes it.

Coming back to the distinction between discrete and continuous structures, this means that in discrete systems, there is no deterministic causeeffect relationship, because points are totally separated from each other, whereas in continuous systems, causal relations can be observed and measured, but not predicted3 (Longo 2018, 86). Though it may appear inconsequential at frst, this observation about causality has specifc and profound implications that stretch well beyond mathematical and physical reasoning. Stating that in discrete structures such as say a database where something belongs to either one category or another, no causeeffect relationship of observed phenomena can be established but only a probabilistic one essentially means that explanations for such phenomena cannot be found, only correlations. If two random variables are correlated, or as noted by Calude and Longo (2017), *co-related*, it means that they are associated according to a statistical measure, that they co-occur. This statistical measure is rendered by a correlation coeffcient, a number between −1 and 1 that expresses the strength of the linear relationship between two numeric variables. If two variables are positively correlated (e.g., they both increase), then the correlation coeffcient will be closer to 1, if there is a negative correlation (i.e., they are inversely correlated), it will be closer to −1, and closer to 0 if there is no correlation at all. It is a wellestablished fact in statistics and beyond that a correlation coeffcient per se is not enough to explain the cause for the patterns that are captured.<sup>4</sup>

The identifcation of statistical correlations is nevertheless an important factor in understanding the relationship between two quantitative variables and it remains an insightful method that can potentially lead to signifcant discoveries. Indeed, the observation of correlations is at the foundation of the classic scientifc method in the sense that starting from the measurement of correlated phenomena, scientists have been able to formulate theories that could be tested and later confrmed or disproved. The history of science is full of extraordinary achievements which originated from mere observations of not-so-obviously correlated phenomena, for example, distributional semantics theory, a famous linguistic theory that stemmed from the intuition of Zellig S. Harris and John R. Firth, two semanticists (though Harris was also a statistical mathematician). This intuition famously captured by Firth's quote 'You shall know a word by the company it keeps' (1957, 11)—acknowledges the relevance of words' collocation (i.e., the place of occurrence of words) in determining their meaning. The core idea behind Harris and Firth's work on collocational meaning and distributional semantics is that meanings do not exist in isolation; rather, words that are used and occur in the same contexts tend to purport similar meanings (Harris 1954, p. 156).

In those days, gaining access to real language data was costly and very time-consuming and for a long time, it was not possible to test this theory. But more recently, new advances in computer science merged with huge quantities of naturally occurring language material, including digitised historical data-sets, have indeed proven that languages are not deterministic systems—as previously believed—but that they should be thought to be 'probabilistic, analogical, preferential systems' (Hanks 2013, 310). As intuitively theorised in distributional semantics, words do not have a one-to-one relationship with meaning because meanings are not precise, exact or stable. To the contrary, words in isolation do not possess any meaning and meanings can only be entailed from words' context. As argued by Harris, 'We cannot say that each morpheme or word has a single or central meaning, or even that it has a continuous or coherent range of meanings' (Harris 1954, 151). Sixty years after its initial formulation, distributional semantics theory laid the basis for Google's renowned word2vec algorithm, and today, it constitutes the theoretical background of NLP studies concerned with language and meaning, including the very topic modelling (*cfr.* Sect. 4.4).

Coming back full circle to causality, correlations and patterns, a correlation measure only informs us of the strength of a relationship between two variables, whereas the patterns tell us that certain regularities can be found in how the observed variables are distributed. Hence, for they highlight trends in the data, correlations and patterns may potentially have predictive power, but neither of them provides causal explanations for the analysed phenomena nor they intrinsically carry signifcance. In the next section, I will elaborate on these refections to discuss the important implications for society of operating predominantly within the discrete system of the contemporary encoding of all digital information, binary sequences of 0s and 1s. Taking the example of analysis of material that had originally been conceived of as a coherent entity, i.e., continuous (e.g., a book, a collection of essays on the same topic, all the issues of a newspaper), I explore the implications of its digital encoding into discrete form through digitisation and subsequent digital analysis. One critical implication, I argue, is that the adoption of an indiscriminate, data-driven approach to analysis risks to completely disregard context and to attribute meaning to correlations and patterns per se. Through the example of topic modelling and its application to the analysis of *ChroniclItaly 3.0*, further in the chapter, I show how the application of concepts and methods of the post-authentic framework to digital knowledge creation can be useful to prompt a critical stance towards computational methods and tools which I argue is urgently needed for the confguration of a model for knowledge production in the digital.

# 4.3 MANY PATTERNS, FEW MEANINGS

Big data analytics (*cfr.* Sect. 1.2) is supported by the idea that correlations are expected to be recurrent, i.e., they will iterate similarly along the chosen parameter, for example, time (Calude and Longo 2017, 602). Recurrent correlations are an established scientifc principle and they can be observed in natural cycles such as the water cycle and the alternation of seasons. The recurrence of correlations suits well deterministic systems in which it is believed that one can determine the future state of a physical system from its present state (*cfr.* Sect. 4.2). This is precisely what the 'big data philosophy' states: because patterns are expected to be recurrent, the future can be predicted by statistical algorithms based on the patterns found in past data, without the need for causal explanation. Naturally, the larger the data-set, the more accurate the prediction.

This idea that all that counts are the patterns is not in fact new and it can be traced back to the 1990s and to Complexity Theory (Waldrop 1992). Complexity Theory argues that there is a hidden order to the behaviour and evolution of complex systems and chaos can be made manageable by looking at its underlying, ubiquitous patterns. What these patterns show is how complex systems work, more specifcally how organisations cope with uncertainty and nonlinearity and manage to remain stable. The idea behind Complexity Theory is that complex systems are the assemblage of extremely convoluted factors which make them fundamentally unpredictable. Yet, at the same time, complex systems exhibit order rules according to which independent actors, i.e., discrete elements, spontaneously self-organise. This contradictory property makes it possible for patterned behaviour and properties to be observed. It also means, however, that the meaning of any system is irrelevant as the focus is and remains on the observed behavioural patterns.

One does not have to dig too deep to see how computer science has strongly supported Complexity Theory. Indeed, Complexity Theory fts perfectly with what machines excel at: fnding patterns in the data (Turkle 2014). Ever powerful computers can be given enormous quantities of data and instructed to fnd the patterns that human beings will never be able to fnd. And it works. Patterns are always found. However, despite appearing (at frst, at least) logically sound and despite being validated by the cycles present in nature, the discourse surrounding big data analytics obscures at least four fundamental truths. Firstly, as said earlier in the chapter, in discrete systems such as a database, no cause-effect relationship of observed phenomena can be established but only correlations and patterns. Computers are not programmed to fnd meanings, only the patterns; as correlations and patterns do not intrinsically carry signifcance, this essentially means that databases provide an a-causal image of the world (Longo 2018, 86). Thus, what the big data hype obscures is that today's computer-dominated world offers us countless patterns but no explanations for them, and so we are left to deal with a patterned, yet acausal, way of making sense of reality.

Secondly, the idea that information is uniquely absorbed from data is also closely related to Complexity Theory. The theory argues that complex systems are constantly altered by agents' interactions through a process of feedback loops; thanks to their intrinsic capacity to learn from experience, complex adaptive systems are organic and better evolving. The big data approach has essentially adopted this theory in toto, but it seems to have failed to recognise that machines are in fact incapable to *learn*. Indeed, the deterministic belief that the future state of a physical system can be predicted from the observation of its past state, which in any case has been criticised over the course of the twentieth century and mostly disproved as discussed in Sect. 4.2, has become confated into the metaphorical use of the word 'learning' in ML. The familiar notion of 'learning' confounds what learning actually means for a machine—fnding correlations and patterns but no causal explanations—with the human capacity to understand and make sense of the world, i.e., attempting to fnd causality.

Thirdly, the big data analytics' deterministic claim that based on available data, one can provide accurate predictions of the future without the need for causal explanation is provably wrong. Calude and Longo (2017) demonstrated that in a large enough data-set, there will always be correlations but most of them will be random, i.e., meaningless. This means that the probability that a series of correlations will be recurrent as in the natural cycles is extremely low; the authors explain: 'recurrence may occur, but only for immense values of the intended parameters and, thus, an immense database' (ibid., 609). In other words, the patterns found in databases do not per se constitute suffcient proof to offer reliable predictions of the future because most of these patterns will actually be false positives. In techniques such as topic modelling, an element of randomness is in fact built into the algorithm itself as initially, documents are assigned to topics through random probability. Although it is true that the calculations become increasingly accurate as the algorithm iterates through more documents, the risk once again is to see meaning where there is none.

Fourthly, the fact that databases are exact, i.e., discrete, perpetuates the false belief that data is also exact, neutral and objective. It is always emphasised by the 'big data philosophy' that statistical algorithms will fnd patterns where nobody else can, and because databases are exact, this is enough. What is on the contrary not at all emphasised is the subjective and interpretative dimension of collecting, selecting, categorising, aggregating, in other words of *making* data. Recognising that data is created makes the claims of absolute impartiality, exactness and reliability shaky at best and ethically concerning at worst, particularly when necessarily incomplete, biased and opaquely collected data is used to make predictions that infuence decision-making processes or produce research fndings.

Reassuringly, these limitations have recently started to be at the centre of the academic debate and have originated the so-called causal inference challenge. In their work *The Book of Why* (2018), computer scientist Judea Pearl and mathematician Dana Mackenzie argue that these limitations make the big data philosophy inadequate to solve our world's challenges. They note that as current ML solutions cannot fnd the causality relations between patterns, they inevitably fail to generalise beyond the domain of examples present in a given data-set, which most of the time will include synthetic data (as opposed to real-world generated data). In other words, most current ML methods tend to 'overft the data', meaning that 'they try to learn the past perfectly, instead of uncovering the real/causal relationships that will continue to hold over time' (Gonfalonieri 2020). New avenues in this direction are increasingly being explored and have resulted in new emerging felds such as causal machine learning (see for instance, Pearl et al. 2016; Shanmugam 2018; Hernán and Robins 2021). However, although the interest in this topic has grown exponentially in the span of only a few years, methods and applications are still at an experimental stage and, to my knowledge, primarily limited to academic research.

# 4.4 THE PROBLEM WITH TOPIC MODELLING

The topic modelling algorithm essentially formalises distributional semantics theory (*cfr.* Sect. 4.2). However, whereas the focus of distributional semantics theory is on the meaning of a single word, topic modelling tries to capture the overall meaning of clusters of words that appear together (i.e., that are correlated) in a document. Put it differently, as single words do not possess any meaning but meanings can only be entailed by their context, topic modelling assumes that groups of words also purport collective meanings, i.e., *topics*. This all sounds very logical but there is a caveat. Similar to quantum, computational and genetic systems, languages are discrete representations (i.e., outputs) of fundamentally continuous structures (i.e., inputs). This property—called the discrete infnity of language—essentially means unlimited productivity from limited means (Chomsky and Smith 2000). It describes the ability of languages to create an infnite variety of expressions of thought from a limited set of discrete elements (Studdert-Kennedy and Goldstein 2003). The discrete infnity of language necessarily entails that languages are intrinsically ambiguous because meaning is context-bound, but signifcantly, it indicates that different contexts shape the creation of infnite meanings. The problem with topic modelling is that it provides a probabilistic representation of words' distributions in the ingested documents, but it is completely agnostic of the underlying continuous structure of such documents, such as the ambiguity of words' use in each document and across texts as well as the documents' coherent substructure, let alone their wider historical, social and cultural entrenchment.

As said earlier in the chapter, topic modelling provides a probabilistic representation of how words are distributed in documents according to statistical calculations, that is, correlations. This means that words are considered to be discrete elements; for example, in the corpus preparation stage (*cfr.* Sect. 4.5.2), words are transformed into numeric variables and their distribution across documents is represented as a distribution matrix. What topic modelling then does is measuring the strength of the linear relationship between these numeric variables. But topic modelling also treats the corpus itself as a collection of discrete data, which means that each text is also processed as a separate entity totally disconnected from all the other texts in the batch. This is true regardless of whether the input is all the chapters from the same book, all the issues of a newspaper or all the abstracts ever submitted to an academic journal under the keyword tag 'topic modelling'. In other words, it is a computational technique that effciently identifes patterns of words' distribution, but because it lacks the words' underlying continuous structure—the infnity of language—no cause-effect relationship of the correlated phenomena can be established, i.e., the meaning of such patterns.

Another issue with topic modelling is that it assumes that an a priori fxed number of topics—which in any case is decided more or less arbitrarily is represented in different proportions in *all* the documents. Hence, if the algorithm is instructed to fnd X number of topics, it will build a model that fts that number. This assumption behind the technique cannot but paint a rather artifcial and non-exhaustive picture of the documents' content as it is hard to imagine how in reality, a fxed number of topics could adequately represent the actual content of all the analysed documents. Thus, correlations will surely be identifed but not all these correlations will necessarily carry signifcance, that is, meaning. Moreover, as countless parameters can be tweaked, the smallest change will output a different model, in which different correlations will be found and many others will be missing. Conversely, even when the same parameters from the same software are used on the same data-set, the algorithm will output a slightly different model, which indeed proves once again that patterns will always be identifed, regardless of their signifcance. I will return to this point in Sect. 4.5.3.

# 4.5 ANALYSIS OF DIGITAL OBJECTS: A POST-AUTHENTIC APPROACH TO TOPIC MODELLING

The post-authentic framework to digital knowledge creation contributes to the urgent need for the establishment of critical data and visualisation literacy in the current landscape—both public and academic—in which computational techniques and outputs are predominantly framed as and often believed to be exact, fnal, objective and true. Whilst exploiting the new opportunities offered by computational technologies, the postauthentic framework rejects an uncritical adoption of digital methods, and it promotes a model not simplistically oriented towards problem-solving, solution automation and sleek interface designs but towards encouraging critical engagement and active participation. This ultimately means recognising that knowledge is fuid and that the complex challenges we face today therefore require a model of knowledge production that fosters symbiotic collaborations, fuid exchanges and mutualistic contributions, as opposed to hierarchical separation and competition.

As an example of how the application of the post-authentic framework can contribute towards fuid processes of knowledge creation in the digital, including the need for a less naïve conceptualisation of computational techniques, digital objects and methods, I discuss here the third use case of the book: analysis of digital objects. The example of topic modelling demonstrates how critical engagement with computational techniques is urgently required to meet the uncertain and problematic aspects of digital research. For example, in felds such as DH in which this technique is used extensively, a recent survey on LDA topic modelling (Du 2019) found out that 74% of the surveyed studies didn't report how their corpora were prepared, more than 70% didn't report which tool was used to train their topic models, almost 57% didn't report how many topics were trained, and about 90.5% didn't report how their topic models were evaluated.

DH is not at all an isolated case, however. Though with some differences, a similar trend has also been found in software engineering research (Silva et al. 2021) where topic modelling is widely used to analyse online conversations among developers or to improve software engineering tasks such as source code comprehension. From the analysis of 111 relevant papers, Silva et al. (2021) found both general inconsistency and the adoption of opaque methods in topic modelling practices on the whole pointing to a degree of uncertainty on the specifcity of the technique itself. The highest inconsistency was found with reference to tasks such as choosing the number of topics, naming the topics and evaluating the topics' semantic interpretability. The authors attributed the lack of specifcity of the technique to the fact that the majority of the surveyed papers had employed LDA 'as is', that is, they had adopted the default parameters as an off-the-shelf software. This approach, however, is generally not encouraged; computer scientists openly acknowledge that fnding the meaning behind the identifed patterns is highly dependent on the specifcs of the sources because, as argued by Hindle et al. (2015, 510), 'LDA does not look for the same patterns that people do'.

In this part of the chapter, I illustrate how the post-authentic framework can be applied to topic modelling to guide a more mindful understanding of the materiality of the sources. To this end, I deliberately choose cultural heritage material, sources that are inevitably problematic from a computational point of view. I then focus on the key aspects of topic modelling that are highly dependent on the sources and which in my experience have the most signifcant impact on the results: pre-processing, corpus preparation and deciding the number of topics. As a case example, I use the already discussed Italian American newspapers as collected in *ChroniclItaly 3.0* (*cfr.* Chaps. 2 and 3); my aim is to emphasise how preparing the material for the analysis is part of the analysis itself. My discussion demonstrates how, far from being fully automated, neutral and objective, the analysis of a digital object requires the analyst to make countless decisions which are yet different from the ones required when preparing the material for enrichment, even when the same sources are used. Indeed, engagement with the technique starts much earlier than the algorithm's implementation stage, which in any case should also not be performed as a fully automatic operation. The application of the post-authentic framework allows me to evidence how LDA may well be an unsupervised technique, but this simply means that it works with unstructured data,5 and not at all that despite what may be generally believed it does not require human intervention.

#### *4.5.1 Pre-processing*

In Chap. 3, I illustrated how pre-processing operations are far from being standard and how it is in fact required that each intervention is carefully assessed by scholars and practitioners and evaluated on a case-by-case basis. In my discussion, I considered the many infuential factors at play (e.g., the materiality of the source, the specifc task to be performed, the available resources, both economic and technical) and illustrated how they in turn are embedded in a complex, wide net of co-dependent actors, elements and circumstances which have infuenced each other and will in turn infuence current and future interventions. The same considerations apply to the analysis of a digital object; this, I maintain, requires a high level of critical engagement with the chosen method well before than the algorithm's implementation stage. In the case of topic modelling, for example, which takes as its input unstructured data, e.g., plain text, the frst thing one needs to decide is the *scope* (*cfr.* Sect. 3.4), that is, what to consider as *documents* (i.e., the input) (see for instance, Miner 2012). Topic modelling aims to represent documents as probabilistic distributions of words; hence, in a book, the documents could be the book's pages else on a newspaper's page, they could be individual articles and so on. Conceptually, it of course intuitively makes a difference to search for the topics in a chapter vs the topics in each page of that chapter. But this is an important decision to make also from a pragmatic point of view: as topic modelling is essentially a statistical method, the length of each modelled item, i.e., the document, does matter. And yet, although this is a rather determining factor, studies using this method rarely specify how the criteria to decide the scope of the documents are assessed and, even when mentioned, they are referred to vaguely. In Silva et al.'s survey of topic modelling in software engineering research (2021), for example, the authors found that 86% did not mention such criteria at all nor did they acknowledge documents' length as being an important factor; they also found that even when the relevance of the vocabulary size was acknowledged (14%), about a half (7.4%) did not specify the selection criteria or the document's length.

In the case of *CroniclItaly 3.0*, I considered that each fle in the collection corresponds to the frst page of each issue published by the newspapers on a certain date. This structure mirrors the way the collection was digitised by the Library of Congress, evidencing once more the inseparable complexity of relations between digital material and its wider entrenchment in the surrounding digital infrastructure that created it and/or provides it. Therefore, I defned as *documents* each fle/issue as it was in the collection; the decision had the dual advantage of modelling the documents according to the events narrated on a day/issue basis while following the Library of Congress metadata schema.

In terms of preparatory operations such as removing stopwords, lowercasing, removing punctuation, numbers, special characters (cfr. Chap. 3), for the specifc task of topic modelling, additional specifc linguistic decisions must also be evaluated, here I discuss stemming and lemmatisation. Although both aim to obtain a word root by reducing the infection in words, these operations are built on very different assumptions. Stemming deletes the initial or fnal characters in a token based on a list of common prefxes and suffxes that may typically occur in the infected words of a language (e.g., states → state). It is therefore language-dependent as it relies on limited cases which would apply exclusively to certain languages that follow specifc infection rules. Therefore for languages that follow fairly regular infection rules such as English, stemming may work reasonably well, but applied to highly infectional languages such as Italian, due to its many exceptions and irregularities, the algorithm would almost certainly perform poorly. Another strong limitation of stemming is that in many cases—including low-infectional languages—the output would not be an actual word, meaning that the operation is likely to introduce new errors. On the other hand, as it is not a particularly advanced technique, stemming does not require a long processing time or processing power, and therefore this solution may be implemented when working with particularly large corpora or when constrained by time limitations.

Lemmatising is on the contrary a much more sophisticated technique as it is based on more solid linguistic principles than stemming. By means of detailed dictionaries that contain lemmas and by examining words' context, a lemmatising algorithm analyses the morphology of each word and it then transforms it into its grammatical root (e.g., better → good). Especially in the case of topic modelling in which the output is essentially a list of words without any context, lemmatising can be very helpful to distinguish between homonyms, words that have the same spelling, sometimes the same pronunciation too but which in fact possess different meanings. For example, the word *mento* in Italian can mean either 'chin' or 'I lie'. A lemmatising algorithm would theoretically be able to entail the use of *mento* from its context and distinguish it from its homonym; in this case, the different outputs would be *mento* (i.e., chin) for the former and *mentire* (i.e., to lie) for the latter. Because of its complexity, however, lemmatising may require a long time and very high processing power to perform, and so in the case of large size collections or depending on the available means and resources, it may not be ideal. Additionally, if on the one side lemmatising is effective at differentiating between homonyms, on the other the reduction of all infected words to their lemma may cause information loss. For instance, it would no longer be possible to recognise the tense (present,

past, future) or the grammatical person (I, they, you, etc.) of the verbs, the gender or number of the nouns, the degree of the adjectives (e.g., superlative, comparative), etc.

To assess whether this type of information is relevant or not depends once again on several factors such as the type of data-set (e.g., size, content), the context of the digital analysis, the language of the data-set and the specifc research question(s); researchers should therefore carefully evaluate pros and cons of implementing this operation. For example, in researching narratives of migration as they were told by Italian American migrants, the cons of implementing either stemming or lemmatising would in my opinion exceed the pros. Italian is a highly infectional language and a great deal of linguistic information is encoded in suffxes and prefxes; stemming therefore ill suits it. Similarly, lemmatising the corpus would also cause the loss of information encoded in infected words (e.g., verbs expressed in the frst person, collective concepts expressed by plural nouns) which could bring valuable insights into the cognitive, subjective dimension of the stories told by the migrants.

Finally, whether to perform or not either of these operations is very much dependent on the language of the data-set, not just because different languages have different infection rules, but crucially also because not all languages are equally resourced digitally. Indeed, as discussed in Sect. 2.2, the digital consequence of the fact that most mass digitisation projects have been carried out in the United States and later in Europe is that computational resources available for languages other than English continue to remain on the whole scarce. Such Anglophone-centricity is often still a barrier to researchers, teachers and curators whose sources are in languages other than English. Indeed, the comparative lack of computational resources in other languages often dictates which tasks can be performed, with which tools and through which platforms (Viola and Fiscarelli 2021b). Moreover, even when adaptations for other languages may be possible, identifying which changes should be implemented, and perhaps more importantly, understanding the impacts these may have, is often unclear (Mahony 2018). This includes lemmatising algorithms and dictionaries which do not yet exist for all idioms; therefore, for particularly under-resourced languages, stemming may be the only, far from ideal, option.

#### *4.5.2 Corpus Preparation*

There are several libraries, for example, in Python or R, as well as offthe-shelf tools (e.g., MALLET) that implement LDA for topic modelling. Some allow for more sophisticated parameters than others, but generally speaking, they all follow the same principles that I have already discussed: a topic modelling algorithm models a number of documents to fnd correlations essentially combining term frequency and word collocation operations. In order to model topics from unstructured text, the material frst needs to be converted into a structured model that allows the algorithm to perform such calculations, for example, through a method called bag of words (BoW). What BoW does is to frst transform the words in the documents into numbers, i.e., into ids; this operation is typically called 'dictionary'. It then builds a matrix based on the frequency of the words in the documents.

The generation of a BoW provides a notable example of the decisive infuence of the analyst on algorithmic processes and therefore ultimately, on the output. Specifcally, in order to prepare the dictionary, i.e., the unique id assignment, the analyst has several so-called optimising operations at their disposal. For example, one might decide to flter out 'extremes', terms in the collection that are particularly frequent or infrequent; this operation may be performed in order to obtain what is believed to be a more representative core vocabulary. There are several ways to perform this task; for instance, the Python library Gensim (Reh˚uˇ ˇ rek and Sojka 2010) has a built-in function called filter\_extremes which flters out tokens in the dictionary based on their frequency of occurrence. The parameters are defned by the user who can decide—though one might argue somewhat arbitrarily—to keep tokens which are contained in a defned number of documents (i.e., no more than in X number of documents and no less than in X number of documents) or to keep only the frst X number of most frequent tokens.

Another very common technique originated in the feld of IR and believed to contribute towards obtaining better topic modelling results is the term frequency—inverse document frequency method (TF-IDF). The method also scores the 'importance' of a word, also known as *weight*, according to its relative frequency, i.e., the frequency of occurrence of that word with respect to the number of documents in the collection in which it appears. In this way, the weight of words that are 'expected' to appear more frequently—generally speaking non-salient words such as prepositions, articles and so on but this is also specifc to the material is resized accordingly. These preparatory operations are believed to help optimise a corpus for IR tasks (not just topic modelling) and in most cases, they may succeed. The assumption is, however, that a word is as important as its relative frequency, which may be true most times, but not always. Indeed, the possibility to capture words that are very rare or that appear in very few documents may be as valuable in that they may indicate a sudden shift in the used vocabulary, which may in turn signal a linguistic change or perhaps even a conceptual one. Furthermore, and perhaps even more signifcantly, these techniques only consider the formal frequency of a word, meaning that they do not cater for how that word is used. In the words of David Blei (Blei 2012, 82)—one of the creators of topic modelling: 'One assumption that LDA makes is the "bag of words" assumption, that the order of the words in the document does not matter'. This approach, defned as 'unrealistic' by Blei himself, may work well for grammatical articles, prepositions or particularly recurrent OCR errors, but as no semantic detection is formally conducted, the frequency of a word, misleadingly referred to as the weight, becomes the unique, determining factor in assessing whether a word is worth keeping or not. What is important to remember is that what is worth keeping for an algorithm may not refect at all the writer's original intention. Languages may be probabilistic systems, but since words do not have a one-to-one relationship with meaning, they are fundamentally ambiguous, preferential systems. For this reason, researchers and practitioners should assess carefully whether using relative frequency methods is the best option when preparing the corpus to train the topic models. For example, research has shown that statistically more accurate models do not necessarily lead to a higher interpretability of the results (Jacobi et al. 2015).

As an attempt to retain the meaning of words, a method that aims to compensate for this shortcoming is preparing the corpus as a dictionary of n-grams, typically bi-grams or tri-grams. These are pairs or triples of words that are statistically more likely to occur together than if they were found independently from each other. Several studies (see for instance, Wallach 2006; Wang et al. 2007; Kherwa and Bansal 2020) have indeed reported that using bi-grams to prepare the corpus may increase topics' interpretability as well as the effciency of statistical methods such as perplexity and coherence (*cfr.* Sect. 4.5.3), developed to help researchers and practitioners optimise topic modelling results. Unfortunately, preparing the corpus as a dictionary of n-grams is a lengthy and intense process which may indeed be costly and time-consuming, especially in the case of very large repositories. Furthermore, researchers working on historical material which typically contains a high number of OCR errors should consider the actual added value of using this technique. Studies on topic modelling which suggest novel IR techniques or improved corpus preparation methods such as those discussed here and which report an increase in the models' quality typically make use of digitally born data such as online flm reviews, blogs, news websites' headlines or contemporary conference proceedings. Being digitally born, these data-sets are of very high quality, especially compared to digitised historical material. Indeed, the amount of OCR errors in historical collections inevitably skims the output as each word containing an error will be interpreted by the algorithm as a new word, even if only by one character. Although pre-processing steps are taken to improve the quality of the collection, many errors may remain. In most cases, these errors would not prevent a human from reading and understanding, but they will interfere with how a machine processes the text. As LDA is a probabilistic method, regardless of the specifc variations in the chosen preprocessing and corpus preparation techniques, the results will be heavily reliant on the data quality.

Finally, it is worth reminding that, due to the intrinsic unstable and nondeterministic nature of topic modelling, assessing how and to what extent any of these corpus preparation techniques actually improves the quality of the models remains diffcult. Users should indeed be aware that fndings obtained with topic modelling can never be fully replicated or generalised even if the same data-sets are used, the same steps are implemented and the same LDA settings are chosen from the same library/tool (Silva et al. 2021, 120). The post-authentic framework acknowledges such limitations and it is mindful of drawing conclusions which are based solely on topic modelling fndings.

#### *4.5.3 Number of Topics*

The weaknesses and limitations as well as the dangers of overly trusting the capacity of topic modelling to fnd meaningful patterns have been openly acknowledged by several authors, including its very creators. Already in 2009, Chang et al. (2009), for example, compared the task of interpreting the topics, i.e., fnding the semantic meaning of the discovered patterns, to the Chinese ritual of reading tea leaves. The authors wanted to warn users of the high risk of attributing meaning to patterns and trends that in reality may be 'spurious' in the mathematical sense, i.e., meaningless (Calude and Longo 2017) (*cfr.* Sect. 4.3). Naturally, the risk is even higher when the technique is adopted uncritically, especially in felds outside of computer science. The authors clarifed that although typically it is implicitly assumed that the identifed latent spaces will be semantically salient, in reality, this is not at all what the promise of topic modelling is about. Since then, others (see for instance Bail 2018) have also openly acknowledged the limitations of the technique and repeatedly attempted to reframe topic modelling as 'a tool for reading' rather than a tool for meaning, that is, an exploratory tool which in order to obtain more nuanced and reliable fndings, should be integrated with other methods. In this respect, for instance, sociologist Chris Bail (ibid.) notes:

Despite this rather humble assessment of the promise of topic models, many people continue to employ them as if they do in fact reveal the true meaning of texts, which I fear may create a surge in "false positive" fndings in studies that employ topic models.

The application of the post-authentic framework to topic modelling helps reframe the technique as a statistical tool and resizes the user's expectations accordingly. Topic modelling posits a set of multinomial distributions over words—misleadingly called *topics*—as being present in each document in various proportions; it provides fairly accurate models of documents based on their words' distribution as grouped into clusters. This is valuable for obtaining a corpus representation through its words' distribution and/or for predicting a model of unseen text but the commonly shared belief that these identifed word clusters will also be semantically meaningful, i.e., that they will be topics in the human sense, remains only anecdotal (Chang et al. 2009).

The high risk of fnding patterns that are in reality meaningless can be exemplifed by the challenge of fnding the so-called 'optimal' number of topics. This task requires user's input to instruct the algorithm about how many words' distributions it has to search for in the corpus, which of course cannot be known in advance. Depending on individual cases, sometimes researchers and practitioners may know the collection extensively enough to feel confdent about what this number might be; others prefer building multiple models with different numbers of topics to subsequently compare the various compositions of the topics (Viola and Verheul 2019b). If on the one hand this approach allows the researcher to closely examine the varied topics' structures before deciding on the most coherent model, on the other it has the limitation to potentially lead analysts to prefer a model that seems to confrm their a priori ideas, thus resulting in biased interpretations. This approach may work fairly well in those cases when the analyst has extensive knowledge of the material, the feld and the period of reference of the collection among others, but it is generally not recommended in statistics; in the words of statistician Stephen M. Stigler: 'Beware of the problem of testing too many hypotheses; the more you torture the data, the more likely they are to confess, but confessions obtained under duress may not be admissible in the court of scientifc opinion' (Stigler 1987).

More often, however, very little is known about the actual content of the documents as *true* content is exactly what the technique is wrongly believed to be able to fnd, which provides the original justifying argument for using the method. It goes like this: due to the increasingly large size of available digital material, it is not possible for researchers and practitioners to explore the documents through traditional close reading methods; not only would this be too time-consuming but also somewhat less effcient as a machine will always outperform humans in identifying patterns. Although this is in principle true as clarifed earlier, the assumption that all the found patterns are intrinsically meaningful is not. To meet this challenge, research has been conducted towards implementing statistical methods that could help researchers and practitioners fnd the craved 'optimal number of topics'. Two of the most common methods are model perplexity and topic coherence, measures that score the statistical quality of different topic models based on the topics' compositions in several models. Though not unanimously, the believed assumption behind these techniques is that a higher statistical quality yields more interpretable topics. Model perplexity (also known as predictive likelihood) predicts the likelihood of new (i.e., unseen) text to appear based on a pre-trained model. The lower the perplexity value, the better the model predicts the distribution of the words that appear in each topic. However, studies have shown that optimising a topic model for perplexity does not necessarily increase topics' interpretability, as perplexity and human judgement are often not correlated, and sometimes even slightly anti-correlated (Jacobi et al. 2015, 7).

Topic coherence was developed to compensate for this shortcoming and it has become popular over the years. What the method is designed to do is to model human judgement by scoring the composition of the topics based on how *coherent*, i.e., interpretable, they are (Röder et al. 2015). If the coherence score increases as the number of topics increases, for example, that would suggest that the most interpretable model is the one that displays the highest coherence value before fattening out or dropping. Both techniques are widely used to determine the optimal number of topics; the truth is, however, that neither of these measures is ideal because what they actually score is the probability of observations and not their degree of semantic meaning (Chang et al. 2009). In a study by Chang et al. (2009) about topics' interpretability, the authors noted that these traditional metrics do not in fact capture whether topics are interpretable or not as they optimise topic models for likelihood-based measures but, as clarifed earlier (*cfr.* Sect. 4.5), 'LDA does not look for the same patterns that people do' (Hindle et al. 2015, 510). In the study, the authors therefore suggest practitioners to adopt a more critical assessment of the topics' quality.

In this chapter, I have discussed how the use of familiar notions to name computational techniques such as topic modelling, sentiment analysis and machine learning has increased their popularity while creating epistemological expectations that these methods will disappoint. Especially when used outside of their feld of origin, the generated confusion contributes to obfuscate the mathematical assumptions upon which these techniques are built, such as the fundamental difference between discrete vs continuous modelling of information and the stemming consequences. In the context of digital knowledge creation and in relation to the big data philosophy, I refected on the signifcant, yet often overlooked, implications for notions of causality and correlations. I then applied these considerations to describe the third use case of the book, analysis of a digital object, and used the properties and assumptions of topic modelling as the case example of a widely used computational technique that treats a collection of texts as discrete data. I have shown how the post-authentic framework can be used as the applied theory to engage critically with topic modelling by devoting special attention to the aspects of the analysis that are key for maintaining a symbiotic connection with the sources: pre-processing, corpus preparation and the number of topics. Specifcally, I have shown how the application of the post-authentic framework to topic modelling acknowledges the technique at core correct but problematic and therefore in need of critical engagement.

My intention is not to dismiss topic modelling as woefully inadequate, but rather to encourage the integration of the method with critical scrutiny in order to address its limitations. In so doing, I have argued that by introducing a counter-narrative in the main scientistic discourse, the postauthentic framework strains the current system and can help us refgure a novel and more honest model for knowledge production in the digital. For example, when topic modelling is used for humanistic enquiry such as the analysis of cultural heritage material as discussed here, the postauthentic framework serves as a warning that the technique's limitations are particularly signifcant and their impact on the provided interpretation of the past is problematic. I will return to these points in the next chapter in which I discuss the fourth and last use case of the book, visualisation of a digital object. Specifcally, I will show how I have applied the postauthentic framework to prototyping a UI for topic modelling. I will insist on key aspects that aim to promote the active and refective participation of the researcher in the process of digital knowledge production; I will devote particular attention to the added value of building UI elements that contribute to the urgent need for the establishment of critical data and visualisation literacy, especially when computational methods are adopted in felds outside of their original design.

# NOTES


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# What the Graph

Figures don't lie, but liars do fgure. (Attributed to Carroll D. Wright, 1889)

# 5.1 POKER (INTER)FACES

Data visualisation and information visualisation are commonly used as synonyms but it has been argued that they in fact mean different things (Spence 2014; Falkowitz 2019; Ware 2021). The main difference would lie in the basic distinction between *data* and *information* in computer science: data is understood as raw materials (e.g., numbers), that is, the input, and believed to not carry any specifc meaning *per se*, whereas information is the output, i.e., the meaning carried by a set of data. Thus, following this defnition, information visualisation is understood as a cognitive activity (Spence 2014, 2), the process of discovering the meaning associated with a set of data, whereas data visualisation is the process of exploring data that may or may not uncover meaning, i.e., result in information visualisation. Another way to look at it is to consider the purpose of these two activities; data visualisation would essentially be a heuristic activity, whereas the main goal of information visualisation would be to infuence a decisionmaking process (Falkowitz 2019). The two types of visualisations would accordingly translate into distinct products: data visualisations would allow several levels of interactions (e.g., fltering, zooming, selecting, aggregating), whereas information visualisations would simply show one or limited viewpoints while obscuring other perspectives more or less deliberately. Thus, according to this logic, only data that function as cognitive tools become information and therefore not all data is information.

The post-authentic framework that I advance in this book argues against binary conceptualisations that misleadingly suggest and continue to perpetuate the artifcial notion of 'raw data', as if data could naturally pre-exist in a pristine, untouched environment, as if all the steps preceding the visualisation, for example selection, collection, compilation, categorisation and storage, would not already be acts of interpretation and creation (Manovich 2002; Gitelman 2013; Drucker 2020). The post-authentic framework therefore transcends the distinction between data and information and between data visualisation and information visualisation; it acknowledges that data always embed the interpretative dimensions that have originated it. It also recognises that not just the processes of data creation but equally the very tools and methods adopted for creating data are equally situated, limited and partial. Actions, tools, algorithms, platforms, infrastructures and methods are never neutral because they themselves stem from systems that are in turn situated and therefore already interpreted. Hence, whether the intent is to explore data or to persuade through data, the post-authentic framework to visualisation advocates transparency in the way the data is created and conclusions are drawn. In light of the considerations reasoned in the previous chapters, I will therefore use these terms interchangeably to signal that we need to move beyond the distinction between data and information and consequently between data visualisation and information visualisation because data is always produced to various degrees.

Historically, innovations in data visualisation have originated from concrete, often practical goals (Friendly 2008, 30) so it is no surprise that the explosion of data of the last two decades and the subsequent need to analyse it and interpret it paired with advances in technology and statistical theory have greatly impacted the feld. Indeed, as it is praised for its capacity to promptly display emerging properties in the data as well as to enhance access, visualisation has increasingly become an integral part of the digital. For example, using information visualisation to better understand the complex, internal processes according to which ML models elaborate data and provide results has been shown to offer insights that may lead to more transparency and increased trustworthiness in ML outputs and it has therefore become very popular in recent years (Chatzimparmpas et al. 2020).

Visualisation has also gained a signifcant role in the context of analytical methods, including topic modelling. Studies have argued that graphic display tools are valuable not only for understanding the models' results but, because similarity measures and human interpretation are partially misaligned (*cfr.* Chap. 4), also for a general assessment of whether topic modelling is at all a suitable technique for AI and cognitive modelling applications (Murdock and Allen 2015, 4284). Several possible visualisation solutions have therefore over the years been proposed towards solving some of the already discussed challenges around topic modelling. These can be roughly divided into two research directions: the use of visualisation to improve the interpretation of the results and, stemming from the frst one, the use of visualisation to improve the results themselves. Solutions in the frst category try to enhance topics' interpretability by visualising the results in a variety of ways using different statistical measures. *Termite* (Chuang et al. 2012), for example, allows terms' comparison within and across topics using saliency measures based on the concept of weight (*cfr.* Chap. 4), but it does not allow for document interactivity. Chaney and Blei (2012) propose a web-based interface to allow nontechnical users to navigate the output of a topic model but it is not possible to draw comparisons of the topics' distribution across documents. *TopicNets* (Gretarsson et al. 2012) visualises the relations between a set of documents (or parts of documents) and their discovered topics in the form of an interactive network-type graph (i.e., nodes and edges), but it does not show topic or document composition. *LDAvis* (Sievert and Shirley 2014) visualises terms within a topic according to weighted topic-word and topictopic relationships but the connection with the documents is lost. Finally, *Topic Explorer* (Murdock and Allen 2015) builds on *LDAvis* by visualising topic-document and document-document relationships as well as topic distribution and document composition.

Studies in the second category allow users to interact with the models through a variety of human-in-the-loop<sup>1</sup> (HINTL) methods. For example, *iVisClustering* (Lee et al. 2012) allows users to manually create or remove topics, merge or split topics and reassign documents to another topic while visualising topic-document associations in a scatter plot. Using *ITM*— Interactive Topic Modelling (Hu et al. 2014)—users can add, emphasise or ignore words within topics, whereas with UTOPIAN (Jaegul Choo et al. 2013), users can adjust the weights of words within topics, merge and split topics and generate new topics. Hoque and Carenini (2016) and Cai and Sun (2018) also propose visual methods to curate the topics by adding or removing terms within a topic, adjusting the weight, merging similar topics or splitting mixed ones, manually validating the results and fnally generating new topics on the fy.

As this brief literature review shows, one of the main challenges of topic modelling, namely interpreting the results, has so far been tackled from a problem-solving point of view for which the main task is essentially to best exploit the model's identifed document structure (Blei 2012). Signifcantly, what all these studies have in common is the implementation of visualisation techniques exclusively in the fnal stage of a topic modelling workfow, that is, either to interpret the algorithm's output or when training the algorithm itself. What these visualisation interfaces clearly show is the persistent conceptual disconnection between the results and the processes that generated them, the common belief that only interventions on the algorithm or on the fnal output are worthy of study and examination and so interventions on the sources-data are dismissed as not immediately relevant. As I have argued in Chap. 3, these processes of manipulations are often seen as 'standard', unproblematic and inconsequential rather than as heavy interventions on the sources and therefore on the results. The post-authentic framework that I propose in this book, on the contrary, strives to preserve and maintain the connection between the analyst and the digital object and it opposes any naïve conceptualisation of digital objects as fnished, fxed, unproblematic entities. The post-authentic framework ultimately sees the human-digital object relationship as an essential component of the process of knowledge production in the digital. When applied to UI, the post-authentic framework is therefore not only mindful of such connection but it in fact encourages the scholar to be critically aware of it. My efforts towards building a post-authentic interface for topic modelling that I present here are therefore guided by this intention to enable users to actively engage with their digital sources and take ownership of their interventions but also to self-refect and critique on those, thus openly acknowledging the interpretative dimension of the digital research process.

The endless fow of digitised material and the need to store it, access it and analyse it has impacted the role of visualisation also in those felds that traditionally relied on material sources, for instance, cultural heritage, history, linguistics and more widely the humanities. With specifc reference to cultural heritage, for instance, institutions have over the years resorted more and more to visual means—typically web-based interfaces—as a way to enhance access to cultural collections for users' appreciation as well as for research purposes (Windhager et al. 2019a). In a survey of information visualisation approaches to digital cultural heritage collections from 2014 to 2017, for example, Windhager et al. (ibid.) found out that visualisations of digital cultural heritage material have steadily increased, peaking in 2015. At the same time, however, these authors also highlighted that the seventy visualisation systems, prototypes and platforms they surveyed were sharing 'overly narrow task- and defciency-driven approaches to interface design that are grounded in a simplistic user-as-consumer- and problem solver-model' (ibid., 13). Drucker (2013; 2014; 2020) has also long argued that graphical displays in the humanities often display a function- and task-driven UI design and generally lack a critical stance towards visualisation, evidencing IR intentions rather than the elicitation of curiosity, thoughtful engagement and refection.

The post-authentic framework that I advance in this book aims to contribute to the urgent need for the establishment of critical data literacy, including visualisation literacy. It conceptualises digital objects as unfnished, situated processes, and it acknowledges the limitations, biases and incompleteness of tools and methods adopted for the analysis and visual representation of digital content. It provides helpful concepts for a re-theorisation of the process of digital knowledge creation, including the implementation of re-devised practices which are also acknowledged as always being adapted, unfxed, unfnished, arranged and interpreted. Applied to visualisations and interfaces, it acknowledges them as problematic endeavours that embed a wide net of situated processes, and it caters for their novel conceptualisation as epistemic objects which themselves carry meanings and therefore bear consequences.

Post-authentic graphical displays counter what I call *poker interfaces*, attractive visualisations and sleek interfaces that tend to present information as detached from any subjectivity or which obscure or even break the connection with the digital object and the multiple layers of manipulation. In this chapter, I discuss two examples of how the post-authentic framework can be applied to visualisations; in Sect. 5.2, I examine prototypical work for designing a topic modelling interface whereas in Sect. 5.3, I present the design choices we took whilst developing DeXTER, the interactive visualisation app to explore enriched cultural heritage material currently loaded with *ChroniclItaly 3.0* (*cfr.* Sect. 2.4). My discussion will specifcally revolve around the challenges of promoting symbiotic exchanges when engaging with software especially focusing on the efforts we took to expose—rather than hiding—the ambiguities and uncertainties of NA and SA. I end the chapter by acknowledging digital visualisation as fundamentally a curatorial operation which requires countless of subjective decisions that intervene on the digital object with several layers of manipulation; the post-authentic framework to graphical display, I conclude, can guide the encoding of such processes in the visualisation.

# 5.2 VISUALISATION OF DIGITAL OBJECTS: TOWARDS A POST-AUTHENTIC USER INTERFACE FOR TOPIC MODELLING

The development of a post-authentic interface for topic modelling should be understood in the context of the wider project Digital History Advanced Research Projects Accelerator (DHARPA),2 within which software for DH research is currently being developed. Originally conceived by Sean Takats, the DHARPA project today is a team of developers and academics who continuously contribute to each other's expertise by sharing knowledge and practises from a range of disciplines (computer programming, data engineering, data visualisation, linguistics, geography and various strains of history) (Cunningham et al 2022). Like DeXTER, DHARPA is hosted at the C2DH (*cfr.* Chap. 2). At the heart of DHARPA is *encoding criticism*, the effort of advocating the active and refexive participation of the scholar in the process of digital knowledge production (Viola et al. 2021). Digital tools and techniques have been harshly criticised for alienating humanities scholars from their sources (ibid.) (*cfr.* Chap. 1), a bond regarded as crucial for the pursuit of scholarly enquiry; the driving rationale of DHARPA is that through critical assessment, contextualisation and documentation of digital methodologies—which are understood as partial and situated—such relationship can on the contrary be fortifed and expanded. With this aim, DHARPA is developing software that operationalise critical epistemology by placing the scholar-source relationship at its centre. The efforts towards building a post-authentic interface for topic modelling that I present here are therefore guided by the very same intention to enable users to actively engage with the digital object and take ownership of their interventions. Moreover, through the post-authentic lens, my aim is to openly acknowledge the interpretative dimension of the digital research process and thus to embed self-refection and critique into software's both back-end and front-end. The confuence of the post-authentic framework, DeXTER, DHARPA and the C2DH is a perfect example of how the notions of symbiosis and mutualism can guide the process of knowledge creation in the digital.

The post-authentic framework opposes any conceptualisation of digital objects as something disconnected from the material sources; when applied to UI, it is therefore oriented towards safeguarding such connection and encouraging the scholar to be critically aware of it. The example of the NLP software MALLET (McCallum 2002) illustrates a case in which this connection is obscured. MALLET is a widely used ML tool for a range of NLP tasks such as document classifcation, clustering, topic modelling, information extraction and others. During the steps of data preparation for topic modelling (*cfr.* Sect. 4.5), for example, the analyst is never prompted to view the results of their interventions and overall, there is little chance of interacting with the digital object. This does not intrinsically mean that any topic modelling analysis based on MALLET is to be discarded, but it does mean that a distance is imposed between the sources and the analyst. I argue that it is this distance that inevitably causes disconnection and increases the risk to attribute meaning to spurious patterns (*cfr.* Sect. 4.5.3). Indeed, to ensure that the identifed patterns carry actual signifcance, considerable efforts need to be subsequently directed towards regaining this connection, sometimes in the form of novel analytical methodologies such as the discourse-driven topic modelling approach (DDTM) we developed within OcEx (*cfr.* Sect. 2.4) (Viola and Verheul 2019b). This approach integrates topic modelling with the discourse-historical approach (DHA) (Reisigl and Wodak 2001), an applied method of critical discourse analysis theory (van Dijk 1993) which triangulates linguistic, social and historical data to understand language use in its full socio-historical context and as a refection of its cultural values and political ideologies (Viola and Verheul 2019b). The integration of DHA into topic modelling is particularly useful for tasks such as topic interpretation and labelling, thus reducing the risk of attributing meaning to spurious patterns.

Applied to interface design, the post-authentic framework strives to avoid the human-digital object disconnection by prompting critical engagement with the specifcity of the source. Taking once again the example of *ChroniclItaly 3.0*, the post-authentic framework devotes careful attention to never lose contact with the information embedded in the flenames themselves. Based on the Library of Congress cataloguing schema, the flenames carry valuable metadata information including the reference code of the newspapers' titles, the page number and the publication date of each issue (Viola and Fiscarelli 2021a). The reason why it is so very important to critically engage with this information is once more due to the specifcity of the source. Immigrant newspapers were constantly on the verge of bankruptcy which caused titles to be often discontinued; for the same reason, some newspapers could afford to publish biweekly or even daily issues, while others could only publish intermittently (Viola and Verheul 2019a,b). This is naturally refected in the composition of the collection; newspapers like *L'Italia*—one of the most mainstream Italian immigrant publications in the United States at the time—and *Cronaca Sovversiva*, the most important anarchic Italian American newspaper managed to continuously publish for years, whilst others like *La Rassegna* or *La Sentinella del West* which came into being as small, personal projects of their funders could only survive for a few months. Although across the entire period of coverage, on the whole the collection holds a fair balance between the number of issues, the type of newspaper, the geographical location, the time span and political orientation of each title, the exploration of the collection's metadata highlights factors such as over- or under-representation of some titles either on the whole or at specifc points in time. Figure 5.1 displays how the issues are diversely distributed throughout the collection.

The application of the post-authentic framework to digital objects recognises that factors like the heterogeneity of the digital object may result in potential polarisation of topics and points of view; it therefore maintains a connection with the digital object by facilitating access to such information and allowing the researcher to engage critically with it. By embedding the option to explore the metadata information (if present), the post-authentic framework signals the acknowledgement of the continuous underlying structure of a digital object (*cfr.* Sect. 4.2) hidden by its digital transformation into discrete form, i.e., sequences of 0s and 1s. It is indeed this acknowledgment that allows the analyst to obtain a fuller understanding of the object itself, in turn facilitating fundamental tasks such as adjusting the research question, resizing expectations and making sense of the results.

This sustained connection with the materiality of the source has immediate relevance for computational techniques such as topic modelling. As discussed in Sect. 4.4, the LDA algorithm assumes that a fxed number of topics is represented in different proportions in *all* the documents; this is clearly a rather artifcial and unrealistic assumption as it is highly unlikely that one fxed—and to some extent arbitrary—number of topics could adequately represent the content of all the ingested documents.

**Fig. 5.1** Distribution of issues within *ChroniclItaly 3.0* per title. Red lines indicate at least one issue in a three-month period. Figure taken from Viola and Fiscarelli (2021b)

Allowing the analyst to know that the material for the digital analysis is distributed differently acts as a way to highlight that problematic aspects of digital research and digital objects that precede the analysis itself but which nevertheless infuence how the technique may be applied and the results interpreted. Figure 5.2 shows how this step could be handled in the interface. Once the documents are uploaded, the analyst is prompted by a question asking them about the potential presence of metadata information. With this question the intention is to maintain contact with the continuous aspect of the digital object hidden by its discrete representation and further altered by the topic modelling algorithm which treats the documents, too as a collection of discrete data.

If the analyst chooses 'yes', the metadata information would then be used to create a dynamic, interactive visualisation inspired by the one displayed in Fig. 5.1; this would display how the fles are distributed in the collection, ultimately creating room for refection and awareness. In

**Fig. 5.2** Wireframe of a post-authentic interface for topic modelling: sources upload. The wireframe displays how the post-authentic framework to metadata information could guide the development of an interface. Wireframe by the author and Mariella de Crouy Chanel

the case of *ChroniclItaly 3.0*, for example, this visualisation displays the number of published issues on a specifc day, month or year and by which titles; the display of this information allows the analyst to promptly identify the difference in the frequency rate of publication across titles and potential gaps in the collection (Fig. 5.3). The post-authentic framework to visualisation signals the importance of maintaining the connection with the digital object, understood as an organic, problematic entity. Such connection is acknowledged as an essential element of the process of knowledge creation in the digital in that it favours a more engaged, critical approach to digital objects and it creates a space in which more informed decisions can be made and ultimately answering the need for digital data and visualisation literacy.

**Fig. 5.3** Post-authentic framework to sources metadata information display. Interactive visualisation available at https://observablehq.com/@dharpa-project/ timestamped-corpus. Visualisation by the author and Mariella de Crouy Chanel

The post-authentic framework to interface design aims to make the link between the analyst, the digital object's discretised continuous information and the methods employed to manage it, analyse it and visualise it explicit at each stage of the digital knowledge creation process. Informed by these motivations, an interface for topic modelling would facilitate close engagement, for instance by allowing users to create and preview subsets of the digital object (e.g., through fltering *cfr.* Sect. 4.5.2) for further exploration or to test hypotheses on a sample. In this way, the post-authentic framework signals the rejection of objectivist and positivist understandings of digital processes which depict data as pre-existing and somewhat fxed. The interface, on the contrary, would adopt a constructivist principle which exposes the management of data as a problematic enterprise, a subjective act made of constant interpretation, manipulation and decisions which transform, select, aggregate and ultimately *create* data (Drucker 2011). Following these principles, the wireframe in Fig. 5.4 displays how sources' preview could be handled in the interface.

Research that adopts computational techniques rarely acknowledges the infuential role of tools, infrastructures, software, categories, models and algorithms on the research process or the results, as these are typically reputed to be neutral. The researcher or curator often provides little or no documentation of the decisions and the mechanisms that transformed their sources into data (Viola and Fiscarelli 2021b). Through the chapters of this book, however, I have demonstrated that transformative operations such as those directed at the creation, enrichment, digital analysis and visualisation of a digital object involve an intricate network of complex interactions between countless elements and factors including the materiality of the sources, the digital object and the analyst as well as between the operations themselves. Although often presented as more or less 'standard', these operations on the contrary need to be problematised and tackled critically. The post-authentic framework to knowledge creation in the digital acknowledges them as limited and situated, and it prompts a fundamental rethink of how these operations impact the sources and produce a digital object; this challenge, I maintain, can be met by maintaining engaged contact with the digital object. For problematic operations such as preprocessing, stemming and lemmatising (*cfr.* Sect. 4.5.2), this connection can be sustained by prompting engagement, for instance by making processes readily visible and intelligible to the analyst. The wireframes in Figs. 5.5 and 5.6 show how these operations would be handled in the interface. An expandable tool-tip asking 'What is pre-processing?' together with *i* buttons located next to each operation would give users the possibility to access detailed explanations of the available operations—often grouped under opaque labels such as 'data cleaning'—to better understand the assumptions behind them. The UI would also allow data preview, thus making the impact of each intervention visible and accessible to the analyst. These features would create room for more conscious decisions and, at the same time, they would signal that data is always *made*.

The post-authentic framework calls upon the scholar's critical and active engagement in the process of knowledge creation in the digital and raises awareness of the limitations, biases and incompleteness of tools and methods; applied to interface design it can therefore contribute to the establishment of critical data management and visualisation literacy.

**Fig. 5.4** Post-authentic interface for topic modelling: data preview. The wireframe displays how the post-authentic framework could guide the development of an interface for exploring the sources. Wireframe by the author and Mariella de Crouy Chanel

**Fig. 5.5** Interface for topic modelling: data pre-processing. The wireframe displays how the post-authentic framework to UI could make pre-processing more transparent to users. Wireframe by the author and Mariella de Crouy Chanel

In the interface, this would be achieved by entering into a dialogue with the researcher, for instance, by asking the question 'What is corpus preparation?' (Fig. 5.7); the combination of expandable tool-tips and *i* buttons next to each operation would serve the dual purpose of making the process of data creation more intelligible to users while maintaining the connection with the digital object. Indeed, more transparent processes enable a more conscious participation of the scholar in the fuid exchanges between computational and human processes which are understood as part of a wider, complex system of interactions. The post-authentic framework attempts to reach symbiosis and mutualism (*cfr.* Sect. 2.2) by making these exchanges explicit as opposed to a passive and dissociated fruition of such interactions. To the same aim, the output resulting from implementing the different methods for corpus preparation would be saved each time (left panel in Fig. 5.7) so that users could experiment with various methods

**Fig. 5.6** Interface for topic modelling: data pre-processing (stemming and lemmatising). The wireframe displays how the post-authentic framework to UI could make stemming and lemmatising more transparent to users. Wireframe by the author and Mariella de Crouy Chanel

and settings, compare results and make more informed decisions. In this way, the interface would actualise a counterbalancing narrative in the main positivist discourse that equals the removal of the human—which in any case is illusory—to the removal of biases. To the contrary, the argument I advance in this book is that it is only through the active and conscious participation of the human in processes of data creation, tools' selection, methods' and algorithms' implementation that such biases can in fact be identifed, acknowledged and to an extent, addressed.

The post-authentic framework to knowledge creation in the digital advocates a more participatory, critical approach towards digital methods and tools, particularly if they are applied for humanistic enquiry. Against a purely correlations-driven big data approach, it offers a more complex and nuanced perspective that challenges current views sidelining human agency

**Fig. 5.7** Interface for topic modelling: corpus preparation. The wireframe displays how the post-authentic framework to UI could make corpus preparation more transparent to users. Wireframe by the author and Mariella de Crouy Chanel

and criticality in favour of patterns and correlations. Applied to methods such as topic modelling, for instance, the post-authentic framework highlights the assumptions behind the technique, such as discreteness, acausality, randomness and text disappearance. Whilst exploiting the new opportunities offered by computational technologies, it rejects a passive adoption of these methods, and it highlights the intrinsic dynamic, situated, interpreted and partial nature of the digital in contrast with the main discourse that still presents techniques and outputs as exact, fnal, objective and true. Applied to UI, it also provides helpful concepts for both its theorisation and the implementation of re-devised visualisation practices which are also acknowledged as being adapted, unfxed, unfnished, arranged and interpreted.

# 5.3 DEXTER: A POST-AUTHENTIC APPROACH TO NETWORK AND SENTIMENT VISUALISATION

In the context of visualisation, questions of criticality, transparency, trust and accountability have increasingly become part of the scientifc discourse (see for instance Gaver et al. 2003; Drucker 2011, 2013, 2014, 2020; Glinka et al. 2015; Sánchez et al. 2019; Windhager et al. 2019a; Boyd Davis et al. 2021) and several recommendations for operationalising critical digital literacy in visual design have been suggested. For example, the interpretative and evaluative value of ambiguity for design has been praised by Gaver et al. (2003); Drucker (2020) has proposed a framework for visualisations that promotes plurality, critical engagement and data transparency; Windhager et al. (2019a) have suggested design guidelines that also promote contingency (i.e., acknowledging the incompleteness of user experience) and empowerment (i.e., encouraging user's self-activation and engagement) (141), and Sánchez et al. (2019) have offered a framework for managing uncertainty in DH visualisations. Despite an increased awareness, however, research in this area points out how intrinsic aspects of knowledge creation such as ambiguity, uncertainty and errors are still largely hidden from view and how instead the majority of graphical displays tend to be sleek visualisations that convey exactness, neutrality and assertiveness, i.e., poker interfaces.

The post-authentic framework that this book suggests incorporates all these recent perspectives; however, as it refers to the realm of digital knowledge that is created daily, at the same time, it goes beyond them. With specifc reference to visualisations, the post-authentic framework endorses ambiguity, uncertainty and transparency; it acknowledges the incompleteness and partiality of data, tools and methods and rather than mudding it, it exposes their potential untrustworthiness. It is thanks to this awareness, I maintain, that the post-authentic framework contributes to maintain the process of knowledge creation in the digital honest and accountable, both for present and future generations. The visualisations for NA and SA in the DeXTER app that I present here are a good example of how the post-authentic framework can actualise these aims when visualising a digital object.

The DeXTER project is a post-authentic research activity which combines the creation of an enrichment workfow with a meta-refection on the workfow itself as well as the creation of an interactive app to visualise enriched digital heritage collections. This means that the main intention guiding its design is to provoke independent assessment (Gaver et al. 2003), to expose inconsistencies and cast doubts on the digital object and to create a space for interpretation, rather than to provide one. This includes openly acknowledging that the implementation and potential value of the used methods are also inextricably intertwined with the specifcity of the source as well as the research context of the related project. For example, when enriching *ChroniclItaly 3.0*, we used NA and SA to explore the several ways in which referential entities relate to each other in the collection; this included modelling their frequency of cooccurrence in a sentence and how this changes over time, the prevailing attitude towards such entities, and connections between entities at specifc points in time (e.g., on the same day) across the different newspapers. These operations aimed to maximise the potential value of using referential entities as indicators of markers of identity (*cfr.* Chap. 3), that is, as a way to navigate the process of Italian Transatlantic migration as it was narrated by the different communities of Italian immigrants in the United States. Far from being standard, techniques and methods are therefore understood as adapted and chosen and their suitability in need of assessment rather than assumed to be intrinsically good (or bad).

The post-authentic framework can inform the selection of methods by warning the analyst that techniques developed in other felds for specifc aims and with specifc assumptions are not necessarily compatible across different data types. For example, NA is a method that originates in mathematics and graph theory (Biggs et al. 1986), and although it has long been applied across disciplines and for different purposes, it is typically used to answer questions mostly pertaining to the social sciences. This is because the underlying assumption is that the discrete modelling of how actors (e.g., entities) relate to each other (i.e., edges) provides adequate explanations of social phenomena. For a detailed overview of its application particularly in modern sociology, I refer the reader to Korom (2015).

Due to its characteristic feature of schematically representing abstract and often ambiguous information, NA has recently become popular also in the humanities. In linguistics, for example, NA has been applied to large textual corpora of naturally occurring language to analyse the relationship between language and identity in multilingual communities (Lanza and Svendsen 2007) or to explore complex syntactic and lexical patterns as networks, for example in language acquisition or language development studies (Barceló-Coblijn et al. 2017). It has also been argued that NA could be integrated in sociolinguistics as a way to provide insights into the relationship between the use of linguistic forms and culture (Diehl 2019). In branches of DH such as digital history and digital cultural heritage, NA is also considered to be an effcient method to intuitively reduce complexity (Düring et al. 2015). This may be due to the fact that this technique benefts particularly from attractive visualisations which support the impression that explanations for social events are accurate, complete, detailed and scientifc, naturally adding to the allure of using it.

However, a typically omitted, yet rather critical issue of NA is that the graphs can only display the nodes and attributes that are modelled; as these stem from samples which by defnition are incomplete and which undergo several layers of manipulation, transformation and selection, the conclusions the graphs suggest will always be partial and potentially based on over-represented actors or conversely, on underrepresented social categories. In the case of a digital object such as the cultural heritage collection *ChroniclItaly 3.0* which aggregates sources heterogeneously distributed (*cfr.* Sect. 5.2), this issue is particularly signifcant as any resulting graph depends on the modelled newspaper (e.g., mainstream vs anarchic), on the type and number of entities included and excluded and on the attributes' variables (e.g., frequency of co-occurrence, number of relations, sentiment polarity), to name but a few. Each one of these factors can dramatically infuence the network displays and consequently impact on the provided interpretation of the past.

The project's GitHub repository3—which is to be understood as integral part of the visualisation interface—is a good example of how the post-authentic framework can guide the actualisation of principles of transparency, accountability and reproducibility and how it values ambiguity and uncertainty. The DeXTER's GitHub repository documents, explains and motivates all the interventions on the data, including reporting on the processes of entity selection (*cfr.* Sect. 3.3). The aim is to warn the analyst that despite being (too) often presented as a statement of fact, a visually displayed network is a mediated and heavily processed representation of the modelled actors. As such, the post-authentic framework does not solely aim to increase trust in the data and how it is transformed, but also to acknowledge uncertainty in both the data lifecycle and the resulting graphs and fnally to expose and accept how these may be untrustworthy (Boyd Davis et al. 2021, 546). The act of making explicit the interpretative work of shaping the data is what Drucker calls 'exposing the enunciative workings' (2020, 149):

For data production, the task is to expose some of the procedures and steps by which data is created, selected, cleaned, and processed. Retracing the statistical processes, showing the datamodel and what has been eliminated, averaged, reduced, and changed in the course of the lifecycle would put the values of the data into a relative, rather than declarative, mode. This is one of the points of connection with the interface system and task of exposing the enunciative workings.

By acknowledging that the displayed entities are not *all* the entities in the collection but in fact a representative, yet small, selection, DeXTER encourages close engagement with the NA graphs; it does not try to remove uncertainty but it points where it is. At the same time, it recognises the management of data as an act of constant creation, rather than a mere observation of neutral phenomena. For example, the process of entity selection as I described it in Sect. 3.4 created a subset of the most frequently occurring entities distributed proportionately across the different newspapers. With this intervention, we aimed to alleviate the issue of source over-representation due to some titles being much larger than others and to reduce complexity in the resulting network graphs, notoriously considered as the downside of NA. At the same time, however, this intervention may cause the least occurring entities to be under-represented in the visualisations. Thus, the transparent and detailed documentation of how we intervened on the data that originates the NA visualisations counterbalances the illusion of neutrality and completeness often conveyed by ultra-polished NA visualisations.

Another issue of NA data modelling concerns the theoretical assumption upon which the technique is based. As a bare minimum, a network visualisation connects nodes through a line (i.e., edge) that carries information on the type of relation between the nodes (i.e., attributes). Nodes are understood as discrete objects, i.e., completely independent from each other (*cfr.* Chap. 4); this ultimately means that the nodes are modelled to remain stable and that the emphasis is on the relations, as these are believed to provide adequate explanations of social phenomena. However, this type of modelling arguably paints a rather artifcial picture of both the phenomena and the actors who remain unaffected by the changing relationships between them. To put it in Drucker's words:

This is a highly mechanistic characterization of nodes (and edges), whether they consist of human beings, institutions, or events which reduce[s] all relationships to the same presentation and make[s] static representations out of dynamic conditions. (2020, 180)

NA factually transforms continuous (i.e., inseparable) elements such as cultural actors into discrete and fxed points; this transformation is further modelled visually, giving the impression of a neutral, exact and observable description of their entanglement. The possibility to historicise actors and relations in DeXTER is a concrete example of how the post-authentic framework to NA aims to counteract this inevitably artifcial 'fattening effect'. When developing the DeXTER's interface, we decided to model the data points displayed in the graphs according to several parameters and attributes that refect a conceptualisation of networks as lively and dynamic structures. By sliding the time bar (*cfr.* Fig. 5.8), the analyst can,

**Fig. 5.8** DeXTER default landing interface for NA. The red oval highlights the time bar (historicise feature)

**Fig. 5.9** DeXTER default landing interface for NA. The red oval highlights the different title parameters

for example, observe not just how the relationships between entities change over time but also the entities themselves. It is for instance possible to explore how entities of interest were mentioned by migrants over time: by selecting/deselecting specifc titles (*cfr.* Fig. 5.9) of different political orientation and geographical location, by selecting the frequency rate and sentiment polarity (*cfr.* Fig. 5.10) to observe the prevailing emotional attitude of the sentences in which the entities were mentioned together as well as their frequency of occurrence.

By visualising both entities and relations and by creating dynamic and interactive NA visualisations, the DeXTER interface on the whole aims to provide several viewpoints on the same data, and it effectively shows how several dimensions of observance dramatically affect the graphical arrangements. In the case of the historicisation feature, for example, as the

**Fig. 5.10** DeXTER default landing interface for NA. The red ovals highlight the frequency and sentiment polarity parameters

data is modelled in reference to the documents' timestamp, the analyst can swipe the time bar on the top left of the interface to explore the changing relationships between entities over time and/or at specifc intervals. This adds a historical dimension to the networks and allows the analyst to observe and engage with changes in the graphs interactively as they refect how the displayed entities were mentioned by migrants according to changing temporal parameters. We also added informative tool-tips next to each available option to encourage close engagement with the interface, with the process of data creation, with the method of NA itself and the meanings offered by these parameters (Gaver et al. 2003).

The post-authentic framework conceptualises ambiguity and uncertainty as intrinsic elements of knowledge creation in the digital; thus, rather than rejecting them or obscuring them, it preserves them as opportunities to reduce the reliance on potentially biased methods and to remind us on the whole of the illusion of certainty (Edmond 2019). Applied to NA, this means creating a space for interpretation, for instance by exposing the data multi-dimensional complexity (Windhager et al. 2019b; Drucker 2020). In the DeXTER interface, this was implemented by providing multiperspectivity on the same nodes. DeXTER allows users to explore three types of networks: two entity-focused graphs (i.e., egocentric networks) and one issue-focused network. We decided to visualise the networks as egocentric networks for two reasons. Egocentric networks are local networks with one central node, known as the *ego*. This type of network visualises all the nodes directly connected to the ego, i.e., the *alters*. Crossley et al. (2015) suggest that one main advantage of egocentric networks is that they allow for rich visualisations even when all the entities in a data-set cannot be mapped because of the network's large size, which is indeed the case of *ChroniclItaly 3.0* as discussed in Chap. 3. Furthermore, the provided ego's extensive information may offer a personal perspective on the node and the alters; indeed, thanks to this property, egocentric networks are often referred to as cognitive networks (Perry et al. 2018). We therefore chose egocentric network visualisations for their potential ability to provide relevant material for the study of migration as experienced and narrated by the migrants themselves. Starting from a selected entity of their choice, users can explore several parameters: the net of entities most frequently mentioned in the same sentence as the ego, the prevailing emotional attitude in those sentences, the number of times entities were mentioned together and the titles in which they were mentioned. This information is encoded and made available to the analyst both through pop-up tool-tips and through the colour of the edges (i.e., pastel blue for negative sentiment, white for neutral and pastel red for positive). Figure 5.11 shows the egocentric network for the GPE entity *sicilia* (Sicily) across all the titles of the collection as mentioned in sentences with prevailing positive sentiment. If ego-network is not selected, the graph additionally displays the relations among the alters. As shown in Fig. 5.12, the representation of relations can react signifcantly to the tiniest modifcation of parameters (Windhager et al. 2019b); even when the same node is selected, the overall offered perspective on the relational structure of the graph can change signifcantly.

The third type of network visualisation (i.e., issue-focused network) allows the exploration of entities starting from a specifc issue. Whereas in an egocentric network users observe a network which has an actor/entity

**Fig. 5.11** DeXTER: egocentric network for the node *sicilia* across all titles in the collection in sentences with prevailing positive sentiment

of their choice as the focal node, this third visualisation displays the actors mentioned in specifc newspapers on specifc days. In this way, the issuefocused network offers an additional perspective on the same digital object potentially contributing valuable insights for the analysis of how events and actors of interest were portrayed by migrants of different political affliation and who were based in different parts of the United States. Thus, instead of offering one obvious meaning, DeXTER offers multiple perspectives, and by capturing heterogeneous contexts, it creates a tension that the analyst is encouraged to resolve through independent assessment (Gaver et al. 2003). Figure 5.13 shows the default issue-focused network graph.

DeXTER's visualisation of sentiment as an attribute of NA is also guided by post-authentic principles. As already discussed in Sect. 3.4, SA is a computational technique that aims to identify the prevailing emotional attitude, i.e., the sentiment, in a given text (or portions of a text); the sentiment is then typically categorised according to three labels, i.e., positive, negative

**Fig. 5.12** DeXTER: network for the ego *sicilia* and alters across titles in the collection in sentences with prevailing positive sentiment

or neutral. A problematic aspect of the technique is that it presents these labels as unambiguous, universally accepted categories, providing a neutral and observable description of reality, and obscuring the highly problematic and interpretative quality of the very process of establishment of such categories (*cfr.* Sect. 3.4) (Puschmann and Powell 2018). The concept of 'sentiment score' additionally reinforces the illusion of objectivity, and it further obfuscates the inherently vague, profoundly subjective dimension of emotions and their defnitions, a process intrinsically open to multiple interpretations and subject to ambiguity. As a way to acknowledge the ambiguities of the assumptions behind the technique and of a 'sentiment score', DeXTER's graph colouring scheme is fuid and nuanced (as opposed to solid colours): the colour gradients go from a darker shade of blue for the lowest score (i.e., negative) to a darker shade of red for the highest score (i.e., positive). The DeXTER's visual representation of sentiment results in a deliberately blurred graph, the borders of the edges

**Fig. 5.13** DeXTER: default issue-focused network graph

are purposely smudged and pale, and pastel shades are preferred over bright, solid shades; the aim is to openly acknowledge SA as ambiguous, situated and therefore open to interpretation, rather than precise, neutral and certain. By exposing these inconsistencies, post-authentic visualisations on the whole question the main positivist discourse around technology. We achieved this goal by providing a transparent documentation of how we identifed the sentiment categories, how we aggregated the results, how we conducted the classifcation, how we interpreted the scores and how we rendered them in the visualisation, in the openly available dedicated GitHub repository which also includes the code, links to the original and processed material and the fles documenting the manual interventions.

Finally, guided by the post-authentic framework, DeXTER emphasises the continuous making and re-making of data; this process of forming, arranging and interpreting data is encoded within the interface itself. Through the tab 'Data', users can at any point access and download the data behind the visualisations as they refect users' selection of flters and parameters (e.g., title, time interval, frequency, entity). The intention is to disrupt traditional notions that conceptualise data as fxed, unarguable and defned. At the same time, DeXTER acknowledges the collective responsibility of building a source of knowledge for current and future generations, and it frames the process of knowledge creation in the digital as accountable, unfnished and receptive to alternatives.

Through the exploration of several case studies, i.e., the creation, enrichment, analysis and visualisation of a digital object, this book argues that new theoretical paradigms are now urgently required; these must be centred on a reconceptualisation of digital objects as epistemic objects which themselves carry meanings and which therefore alter the perception of knowledge created in a digital environment. With specifc reference to visualisations, interfaces and graphic display, the post-authentic framework that I propose in this book acknowledges them as problematic endeavours embedding a wide net of situated processes which require more systematic and sophisticated criteria than over-simplistic user-as-consumerand problem-solver-models (Windhager et al. 2019a). The recognition of such complexities accepts and in fact embraces digital knowledge creation practices as being embedded in extremely convoluted networks of countless factors at play which cannot be fully trusted nor predicted. The postauthentic framework therefore recognises the limitations and biases of specifc tools and techniques and exposes problematic processes such as data creation, selection and manipulation by openly disclosing their complexities and lifecycle, by thoroughly documenting the decisions and actions and by allowing users to access the data behind the visualisations, including making the acts of transformation explicit.

In the post-authentic interface DeXTER, we actualised this by providing a space for interpretation and individual assessment, by favouring multiperspectivity through different types of network visualisations and by offering dynamic and interactive graphs. This also arguably alleviates the issue of displaying artifcial pictures of social phenomena due to the technique's intrinsic properties for which actors remain stable and unaffected by the relations. While I am not implying that a post-authentic framework is the perfect approach to digital knowledge creation practices, I do argue that, by redefning our understanding of the theoretical dimensions of digital objects, tools, techniques, platforms, interfaces and infrastructures, especially for humanistic enquiry, the framework offers theoretical and methodological criteria that recognise the larger cultural relevance of digital objects, and it provides an urgently needed architecture for issues such as transparency, replicability, Open Access, sustainability, data manipulation, accountability and visual display.

## NOTES


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Conclusion

Philosophers until now have only interpreted the world in various ways. The point, however, is to change it. (Karl Marx, 1846)

As technology changes, society changes and so the way society produces knowledge and culture also changes. Yet, the predominant model of knowledge production continues to be one bound to the epistemology of last century's industrial societies. In this book, I argued that to respond to the radical changes brought by the digital transformation of society and aggravated by the 2020 pandemic, the current model of knowledge creation must urgently be re-theorised. This means, I contended, pushing beyond mere observations of how higher education has been transitioning towards the digital and recognise that a more fundamental question needs to be asked. For example, it is no longer suffcient to refect on how the digital transformation has required teachers to rapidly acquire digital skills to adapt and rethink their learning methods, or how the digital has affected branches of knowledge (e.g., humanities) or individual disciplines (e.g., history), or how differently academics now think about sharing their research fndings (e.g., end-users) or how their research is increasingly dominated by *data* rather than by *sources*, including having to consider issues of storage, archival, transparency, etc. A different critical awareness is now required: the shift has been *in*—as opposed to *towards*—the digital.

Claiming that the shift has been in the digital acknowledges conclusively that the digital is now integral to not only society and its functioning,

L. Viola, *The Humanities in the Digital: Beyond Critical Digital Humanities*, https://doi.org/10.1007/978-3-031-16950-2\_6

but crucially also to how society produces knowledge and culture. My argument for a new model of knowledge production therefore starts from recognising that persisting binary modulations in relation to the digital—for example, between digital knowledge creation and non-digital knowledge creation—are no longer relevant in that they continue to suggest artifcial, irrelevant divisions. Such divisions, I contended, not only slow down progress and hinder knowledge advancement, but by fragmenting expertise, they sustain a model of knowledge that does not adequately respond to a reality complexifed by the digital. It has been the argument of this book that the digital transformation of society requires a more problematised understanding of the digital as an organic entity that brings multiple levels of complexity to reality, many of which have unpredictable consequences. Our traditional model of knowledge creation based on single discipline perspectives, hierarchical divisions and competition is no longer suited to meet the unprecedented challenges facing societies in the digital.

In this book, I developed a new theoretical and methodological framework, the post-authentic framework, which critiques dominant positivistic and deterministic views of technology and computational methods and offers new terminologies, concepts and approaches in reference to the digital, digital objects and practices of knowledge production in the digital. The post-authentic framework breaks with dialectical principles of dualism and antagonism and with the rigid model of knowledge creation that divides knowledge into disciplines and disciplines into two areas: the sciences and the humanities. Dual notions of this kind, I argued, are complicit of an assiduously cultivated discourse that has historically exalted digital methods as exact, rigorous, neutral, more relevant and fundingworthy than critical approaches. This includes the cosy and reassuring myths that data is unarguable, bias-free, precise and reliable, as opposed to sources and human consciousness which have been more and more sidelined as carriers of biases, unreliability and inequality.

My reframing of the digital through the post-authentic framework helps us recognise that the narrative simplifcation around computational techniques and consciousness sidelining cannot be afforded to continue because knowledge does not respect the limits of disciplines and the implications of being in the digital transcend such artifcial boundaries. This is a reality we can no longer ignore and which can only be confronted through a reconfgured model of knowledge creation that would reconceptualise it as happening in the digital. The world has entered a new dimension in which higher education can no longer afford to opportunistically see technology and its production as instrumental and contextual to knowledge and teaching or simply as an object of critique, admiration, fear or envy. The post-authentic framework that I proposed in this book functions as a radical critique of such outdated conceptualisations of the digital and argues that the current model of knowledge creation with its established boundaries between disciplines and specialisations is not suited to respond to the complex challenges of a world in the digital.

Instead, the framework advocates a notion of knowledge as fuid, in which differences are not rejected but welcomed according to the principles of symbiosis and mutualism (*cfr.* Sect. 2.2). Symbiosis and mutualism oppose models of reality that support individualism and separateness as inevitably leading to confict and competition; one such model of reality is the division of knowledge into monolithic disciplines. Borrowed from biology, the concept of symbiosis breaks with the current conceptualisation of knowledge as separate, linear and fragmented into multiple disciplines and that of the digital as a static, inconsequential entity. To the contrary, symbiosis evokes ideas of close and long-term cooperation between different organisms and the continual renegotiation of interactions; past, present and future systems; power relations; infrastructures; interventions; curations and curators; programmers and developers.

Mutualism opposes interspecifc competition, that is, when organisms from different species compete for a resource, resulting in benefting only one of the actors involved. I maintained that our model of knowledge creation based on hierarchical separations between disciplines resembles an interspecifc competition dynamic as it has forced knowledge production to operate within a space of confict and competition. This model, I contended, is outdated and inadequate, it traps curiosity into rigid categories, and it is unsuited to rethink and explain the transformative effect the digital is having on our culture and society; to use Virginia Eubanks' words, it contributes to automate inequality and it can therefore make society worse. I therefore argued that any re-modulation still operating within the current disciplinary model of knowledge creation is no longer suffcient; to this end, I proposed the notions of symbiosis and mutualism to help us reconceptualise knowledge as fuid and inseparable. Symbiosis and mutualism shape a model in which curiosity is fnally given the long overdue free rein, in which the different areas of knowledge do not compete against each other but beneft from a mutually compensating relationship. When asking ourselves the questions 'How do we produce knowledge today?' and 'How do we want our next generation of students to be trained?', the concepts of symbiosis and mutualism may guide our answers.

Symbiosis and mutualism are central notions also for the development of a more problematised conceptualisation of digital objects and digital knowledge production. The post-authentic framework re-examines the digital as situated and partial, an extremely convoluted assemblage of factors and actors, themselves part of wider networks of situated components, processes and mechanisms of interaction and the various forms of power embedded in computational processes and beyond. As such, far from being mere immaterial copies of the originals, digital objects are acknowledged as bearing consequences which transcend traditional questions of authenticity; digital objects are never fnished nor they can be fnished; countless versions can endlessly be created through processes that are shaped by past decisions and in turn shape the following ones. Thus, the post-authentic framework engages with both products and processes which are understood as never neutral, as incorporating external, situated systems of interpretation and management and therefore bearing consequences which go beyond the object-centred culture of authenticity.

To exemplify this complexity of confating humans, entities and processes and past, present and future experiences, I used *ChroniclItaly 3.0*, a digital cultural heritage collection of Italian American newspapers published between 1898 and 1936. Specifcally, I examined and illustrated how the application of the post-authentic framework can inform the creation, enrichment, analysis and visualisation of a digital object. By redefning our understanding of both the conceptual and concrete dimensions of digital objects, tools and techniques, the post-authentic framework provides theoretical and methodological criteria that recognise the larger cultural relevance of digital objects and the methods to create them, analyse them and visualise them it affords an architecture for issues such as transparency, replicability, Open Access, sustainability, data manipulation, accountability and visual display.

Central to the framework is the recognition that illusory, positivistic notions of the digital are ill-suited for the problems of the digital societies we live in. The post-authentic framework exposes aspects of knowledge creation in the digital that oppose both the mainstream fetishisisation of big data and algorithms and an unproblematised understanding of the digital, it addresses issues such as ambiguity and uncertainty, and the subjective and interpretative dimension of collecting, selecting, categorising and aggregating, i.e., the act of creating data. In pursuing my case for a novel model of knowledge creation in the digital, in the book, I presented a range of personal case studies and examined how the application of the framework in my own work helped me address aspects of knowledge creation in the digital such as transparency, documentation and reproducibility; questions about reliability, authenticity and biases; and engaging with sources through technology. Using *ChroniclItaly 3.0* as digital object, I applied the post-authentic framework to a variety of applied contexts such as digital heritage practices, digital linguistic injustice, critical digital literacy and critical digital visualisation and I devoted specifc attention to four key aspects of knowledge creation in the digital: creation of a digital object in Chap. 2, enrichment of a digital object in Chap. 3, analysis of a digital object in Chap. 4 and visualisation of a digital object in Chap. 5. This auto-ethnographic and self-refexive approach allowed me to show how a re-examination of digital knowledge creation can no longer be achieved from a distance, but only from the inside. Ultimately, the book demonstrated that it is only through the conscious awareness of the delusional belief in the neutrality of data, tools, methods, algorithms, infrastructures and processes that the biases embedded in these systems and amplifed by their ubiquitous use can in fact be identifed and addressed.

In Chap. 3, for example, I showed how from pre-processing to data augmentation, the application of the post-authentic framework to the task of enriching digital material can guide each action of an enrichment workfow. Using the case examples of DeXTER and *ChroniclItaly 3.0* (Viola and Fiscarelli 2021a) and informed by symbiosis and mutualism, Chap. 3 illustrated how the post-authentic framework can guide the interaction with the digital, not as a strategic (grant-oriented) or instrumental (task-oriented) collaboration but as a cognitive mutual *contribution*. In particular, I unpacked the ambiguities and uncertainties of methods such as optical character recognition (OCR), named entity recognition (NER), geolocation and sentiment analysis (SA) and showed how the post-authentic framework can help address these challenges, for instance, through a thorough understanding of the assumptions behind these techniques, constant update and critical supervision. The framework recognises curatorial practices as manipulative interventions which especially in the case of cultural heritage material, bear the consequence of being a source of knowledge for current and future generations.

This book was also a refection on the implications of the digital transformation for our perception of the world. Drawing on the mathematical concepts of discrete vs continuous modelling of information (*cfr.* Chap. 4), I discussed some of the repercussions of the transformation of continuous material into discrete form due to the discretisation of society, that is, binary sequences of 0s and 1s, especially consequential for the notions of causality and correlations in relation to knowledge creation. In discrete systems, causality is hidden because information is discretisised into exact and separate points, which must be categorised and made explicit. As a result, we are given a digitally mediated image of the world, meaning that the relational causality of continuous information is replaced by predictions of correlations. Thus, societies in the digital in which the 'big data philosophy' reigns, I argued, are offered countless patterns but no explanations for them. Us—the digital citizens—are left to deal with a patterned, yet a-causal, way of making sense of reality.

Closely related to this point is the use of metaphorical language to name computational techniques, such as *topic* modelling, *sentiment* analysis and machine *learning* (ML); this phenomenon can be seen as a way to make sense of an a-causal reality. Indeed confating specifc mathematical concepts such as discrete vs continuous modelling of information with such familiar notions has created reassuring expectations, that machines can learn to understand language and somewhat provide neutral, precise and understandable accounts from large quantities of textual material. In the case of SA, this altered image is that the subjectivity of human emotions can be reduced to two/three categories and quantifed according to probabilistic calculations; in the case of ML, the unique, holistic human process of experiential learning and of connecting logic with contextual factors is discretisised into probabilities' scores of huge, yet partial, quantities of discrete data; in the case of topic modelling, the text itself disappears and so does its continuous structure, i.e., the wider context that produced it. The computational dissembling of the causal structure by the dualistic system of 0s and 1s hides the original continuous nature to which the data refers. The use of metaphorical language such as 'sentiment', 'learning' and 'topic', I argued, has therefore certainly contributed to make these methods extremely popular, especially outside their felds of origin, but at the same time, by obfuscating the precise mathematical laws upon which these techniques are based, it has created unrealistic beliefs.

The post-authentic framework can be a useful tool to guide the unpacking of properties and assumptions of computational techniques used to analyse a digital object. Using topic modelling as an example, in Chap. 4, I showed how the framework can be applied to engage critically with software. At the core of the framework is the importance of maintaining a close connection with the digital object; for example, in the chapter, I stressed how aspects such as pre-processing, corpus preparation and choosing the number of topics typically reputed as unproblematic are in fact fundamental moments within a topic modelling workfow in which the analyst is required to make countless choices. The example of topic modelling demonstrates how the post-authentic framework can guide the exploration, questioning and challenging of the interpretative potential of computation.

Operating within the post-authentic framework crucially means acknowledging digital objects as living entities that have far-reaching, unpredictable consequences; the continually changing complexity of nets involving processes and actors must therefore always be critically supervised. The visualisation of a digital object is one such process. The post-authentic framework opposes an uncritical adoption of digital methods and points to the intrinsic dynamic, situated, interpreted and partial nature of the digital. Despite being often employed as exact ways of presenting reality, visualisations are extremely ambiguous techniques which embed numerous human decisions and judgement calls. In Chap. 5, I illustrated how the post-authentic framework can be applied to visualisation by discussing two examples: efforts towards the development of a user interface (UI) for topic modelling and the design choices for developing the app DeXTER, the interactive visualisation interface to explore *ChroniclItaly 3.0*. I specifcally centred my discussion on how the ambiguities and uncertainties of topic modelling, network analysis (NA) and SA can be encoded visually. A key notion of the post-authentic framework is the acknowledgement of curatorial practices as manipulative interventions and of how it is in fact through exposing the ambiguities and uncertainties that knowledge creation in the digital can be kept honest and accountable for current and future generations.

Through the application of the post-authentic framework to these four case examples, the book aimed to show how an uncritical and naïve approach to the use of computational methods is bound to reproduce the very opaque processes that the publicised algorithmic discourse claims to break, but more worryingly, it contributes to make society worse. The book was therefore also a contribution to working towards systemic change in knowledge creation practises and by extension, in society at large; it provided a new set of notions and methods that can be implemented when collecting, assessing, reviewing, enriching, analysing and visualising digital material. It is this more problematised notion of the digital conceptualised in the framework that highlights how its transcending nature makes old dichotomies between digital knowledge creation and non-digital knowledge creation no longer relevant and in fact, harmful.

The digitisation of society already well on its way before the COVID-19 pandemic but certainly brought to its non-reversible turning point by the 2020 health crisis has brought into sharper focus how the digital exacerbates existing fractures and disparities in society. Unable to deal adequately with the complexity of society and social change, the current model of knowledge creation urgently requires a re-theorisation. This book is therefore a wake-up call for understanding the digital as no longer contextual to knowledge creation and for recognising that a discipline compartmentalisation model sustains an anachronistic and not equipped way to encapsulate and explain society. All information is now digital and algorithms are more and more central nodes of knowledge and culture production with an increased capacity to shape society at large. As digital vs non-digital positions have entirely lost relevance, it has become increasingly futile to create ultra-specialised disciplines from other disciplines' overlapping spaces or indeed to invest energy in trying to defne those, such as in the case of DH; the digital transformation has magnifed the inadequacy of a mono-perspective approach, legacy of a model of knowledge that compartmentalises competing disciplines. Scholars, researchers, universities and institutions must acknowledge the central role they have to play in assessing how knowledge is created not just today, but also for future generations.

The new theoretical and methodological framework that I proposed in this book moves beyond the current static conceptualisation of knowledge production which praises interdisciplinarity but forces knowledge into rigid categories. To the contrary, the framework offered novel concepts and terminologies that break with dialectical principles of dualism and antagonism, including dichotomous notions of digital vs non-digital, sciences vs the humanities, authentic vs non-authentic and computational/neutral vs non-computational/biased. The re-devised notions, practices and values that I offered help re-fgure the way in which society conceptualises data, technology, digital objects and the process of knowledge creation in the digital.

My re-examination of the current model of knowledge includes not just scholarship but pedagogy too. And whilst this is not the main focus of this book, the arguments I put forward here for scholarship equally apply to pedagogy. In order to achieve systemic change, academic programmes must be updated to include opportunities for critical refections on the pressing issues stemming from the ubiquitous underpinning of AI in our societies. Through real use cases similar to those illustrated throughout the chapters of this book, students would learn about the deep implications of digital technologies on contemporary culture and society. In the words of Timnit Gebru, the research scientist who was recently fred by Google after exposing how strongly biased Google's AI systems are (Bender et al. 2021), 'The people creating the technology are a big part of the system. If many are actively excluded from its creation, this technology will beneft a few while harming a great many'. Indeed, as technology is a central locus of knowledge and culture production and AI technology in particular is dominated by a white, mostly male workforce, the culture that is produced replicates the biases of the almost entirely male, predominantly white workforce that is building it.

Although there may not be any initial intention of using biased models, tech companies become immediately accountable, at least from an ethical perspective if not yet a legal one, as soon as they refuse to acknowledge and correct such biases even when these are clearly exposed. If it is true that governments are spectacularly behind in creating rules for the ethical use of this technology, it is equally true that big tech companies shouldn't wait for laws to be passed. Because of the serious social repercussions of the technology they create, they have a responsibility to bring this issue at the centre of their organisations. Meanwhile, universities also have a responsibility to train in ethical digital management the next generation of thinkers, scholars and academics as well as of digital citizens at large. Equally, research funding agencies must specifcally require that the issue of digital ethics is explicitly addressed by researchers in their projects, for instance, by demanding a digital critical component in their proposals. As users and co-producers of technology, our responsibility is to counterbalance the main AI discourse with new, more honest narratives, to critically refect on how we are producing knowledge today and for tomorrow, and on how we educate the next generation of students and digital citizens

to be like. The post-authentic framework of knowledge creation in the digital provides a framework to communicate and incorporate values of honesty, accountability, transparency and sustainability into knowledge. It reminds us that a racist, sexist, homophobic digital society is not so much a refection of human subjectivity in data and algorithms but proof of its pretend absence.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# REFERENCES

ACHS (2012) Manifesto. https://www.criticalheritagestudies.org/history


© The Author(s) 2023

L. Viola, *The Humanities in the Digital: Beyond Critical Digital Humanities*, https://doi.org/10.1007/978-3-031-16950-2


Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

Bobicev V, Sokolova M (2018) Thumbs up and down: sentiment analysis of medical online forums. In: EMNLP 2018. https://doi.org/10.18653/v1/W18-5906


Island, p 74. https://doi.org/10.1145/2254556.2254572, http://dl.acm. org/citation.cfm?doid=2254556.2254572


thetic terminology in descriptions of the English Lake District. J Hist Geogr 56:43–60. https://doi.org/10.1016/j.jhg.2017.01.006, https://linkinghub. elsevier.com/retrieve/pii/S0305748817300178


# INDEX

## **A**

Accountability, 76, 140 Algorithm, 13, 14, 17 Algorithmic discourse, 14–16 Ambiguities, 33, 34, 76, 123, 132, 141, 143 Artifcial Intelligence, 10, 13, 17, 30, 58, 109, 145 Authenticity, 32, 38–40, 46, 140

## **B**

Bag of words, 99 Big data, 10, 12, 13, 16, 22, 90 analytics, 89–91 culture, 121 hype, 90 philosophy, 89, 91, 104, 142 Brittle book crisis, 43

# **C**

C2DH, 50, 112 Cambridge Analytica, 12 Causal inference challenge, 91 Causality, 33, 79, 84–88, 91, 104, 142 Causal machine learning, 92 Causation, 85, 86 Chain migration, 54 Chronicling America, 43, 48, 49, 51, 61, 63 ChroniclItaly, 47, 51, 53 ChroniclItaly 2.0, 50, 51 ChroniclItaly 3.0, 47, 59, 111, 113, 124, 140 Coherence, 100 Collocation, 88 Completeness, 38 Complexity Theory, 89, 90 Conceptual metaphors, 81 Content enrichment, 58 Continuous, 84, 92 Continuous modelling of information, 33, 141 Corpus linguistics, 73 Corpus preparation, 95, 143 Correlations, 33, 79, 85, 87–91, 93, 104, 122, 142 Coursera, 4 COVID-19 pandemic, 4, 6, 10, 82 Critical data literacy, 116

© The Author(s) 2023 L. Viola, *The Humanities in the Digital: Beyond Critical Digital Humanities*, https://doi.org/10.1007/978-3-031-16950-2

169

Critical data management, 118 Critical digital humanities, 8, 21–23, 25, 26, 29 Critical digital literacy, 17, 111, 123 Critical digital visualisation, 17 Critical Heritage Studies, 39, 42 Critical Posthumanities, 9, 31, 39 Critical visualisation literacy, 116 Cultural heritage, 39, 57, 110, 125

#### **D**

Database categories, 15 Data visualisation, 34, 107, 108, 143 Dead metaphors, 84 Deep learning, 33, 51, 67 Determinism, 86, 87 DeXTER, 33, 34, 50, 58, 70, 76, 111, 123, 125, 128, 130, 132–134, 141, 143 DHARPA, 112 Diasporic communities, 54 Diasporic newspapers, 48 Digital capitalism, 21, 22 Digital cultural heritage, 30, 46, 65, 77, 111 Digital Divide, 78 Digital heritage, 37, 38, 40, 41, 46, 52, 58, 123 material, 70 Digital hermeneutics, 30 Digital humanities, 3, 8, 21–23, 25, 26, 29, 40, 42, 94, 112, 123, 125 Digital inequality, 78 Digital knowledge creation process, 8, 16, 32, 40, 43, 52, 60, 65, 67, 68, 73, 78, 85, 89, 94, 104, 110, 111, 117, 118, 121, 123, 129, 134 Digital knowledge production, 46, 105, 112 Digital language injustice, 63, 98

Digital linguistic injustice, 17 Digital objects, 8, 16, 32, 37, 39–41, 43, 44, 46–48, 54, 62, 65, 67, 70, 94, 110, 111, 113, 114, 116, 118, 134, 140 Digital transformation, 1, 2, 4, 6, 18, 22, 25, 26, 29, 31, 32, 78, 141 of society, 137 Digitally-born heritage, 38 Digital Turn, 2, 10, 11, 19, 50, 86 Digitisation, 41–43, 46, 49, 57, 63, 98 of society, 44, 86 Digitised heritage, 38 Discourse-driven topic modelling, 113 Discourse-historical approach, 113 Discrete, 72, 82, 84, 92, 126 vs continuous, 86, 87 data, 93 infnity of language, 92 modelling of information, 33, 141 Distributional semantics theory, 88, 92 Documentation, 60, 67, 76, 78, 112, 118, 126, 133

# **E**

Egocentric network, 130 Encoding of criticism, 68 Enrichment, 59, 60, 70 Entity linking, 58 Ethnic press, 52, 66, 114

#### **F**

Facebook, 12, 15 F1 score, 64 Framing power, 82 Functional view of causation, 86

## **G**

Gartner Hype Cycle, 3 Gensim, 99 Geocoding, 58, 61, 67, 68 Geolocation, 33, 55, 141 GeoNewsMiner, 50, 64 Graphical User Interface, 50 Graph theory, 124

#### **H**

Heritage, 47 Heritagisation, 58 Higher education, 4, 34 Historical migration, 54 Historical newspapers, 48 Human-in-the-loop, 109 The humanities, 76, 83, 85, 86, 110

## **I**

Immigrant communities, 52 Immigrant press, 52 Information Retrieval, 71, 99, 111 Information visualisation, 107, 108, 111 Interactive Topic Modelling, 109 Inter-annotator agreement, 73 Interdisciplinarity discourse, 19, 20, 27, 144 Interpreting the topics, 101 Interspecifc competition, 45, 139 Issue-focused network, 130 Italian American diaspora, 70 Italian American migration, 52, 53 Italian immigrant communities, 51 Italian immigrant newspapers, 53 Italian Transatlantic migration, 124 IVisClustering, 109

# **K**

Knowledge creation, 3, 8, 17, 26, 27, 44, 55, 58, 77, 144

# **L**

L'Italia, 64, 67 La Rassegna, 67 Language injustice, 68 Latent Dirichlet Allocation, 82, 95, 99–101, 114 LDAvis, 109 Lemmatisation, 59, 97 Lemmatising, 118 Library of Congress, 43, 48, 96, 113 Linguistic categories, 72 Lowercasing, 59, 61

# **M**

Machine learning, 13–15, 58, 62, 84, 90, 91, 104, 108, 113, 142 Mainstream humanities, 21 MALLET, 99, 113 Markers of identity, 66, 70, 124 Mass migrations, 53 Material culture, 39 Materiality of the sources, 61, 63, 65, 66, 95, 113, 114, 118 McKinsey Global Institute, 4, 5 Meaning, 88, 91 Mechanical Turk, 10, 17 Metaphors, 81–83, 142 Microflm, 43, 46, 49, 61, 62 Microtargeting, 12 Migrant communities, 48 Migrants' narratives, 54, 66 Migration, 54 Model of knowledge creation, 137

Model perplexity, 103 Monism, 32 Mutualism, 31, 33, 45, 46, 55, 85, 113, 120, 139, 141

# **N**

Named Entity Recognition, 33, 55, 58, 61, 141 National Digital Newspaper Program, 48, 62 National Endowment for the Humanities, 48 Natural language processing, 71 NER, 61 Network analysis, 34, 111, 123, 124, 143 N-grams, 100 NLP, 51, 74, 88, 113 Number of topics, 95, 143

# **O**

Oceanic Exchanges, 47, 113 OCR, 62, 63 errors, 100, 101 Open Access, 31, 78, 140 Optical Character Recognition, 55 Optimal number of topics, 103 Originality, 38 Originary technicity, 17

# **P**

Pandemic, 5, 8, 16, 18, 25, 26, 29, 44, 137 Paradox of interdisciplinarity, 19, 31 Parts-of-Speech Tagging, 58 Patterns, 85, 87–91, 95, 101, 102, 113, 122 Perplexity, 100 Poker interfaces, 123

Post-authentic framework, 16, 27, 28, 31–34, 39, 47, 55, 58, 62, 65, 70, 72, 76, 78, 79, 84, 85, 94, 95, 102, 110, 113, 118, 121, 123, 124, 129, 134, 138, 143 Posthuman critical theory, 18, 31, 43, 44 Predictive likelihood, 103 Predictive policing, 13 Pre-processing, 95, 118, 141, 143 Punctuation marks, 61

# **R**

Replicability, 31, 76, 77, 140 Roadmap for Digital Cooperation, 78 Rotogravure, 43

# **S**

Scope, 73, 74, 96 Sentiment analysis, 33, 34, 55, 58, 61, 70–73, 82, 84, 104, 111, 123, 124, 131, 141–143 Sentiment magnitude, 74 Sentiment score, 74 Signifcance, 89 Specifcity of the sources, 114, 124 Stemming, 59, 97, 118 Stopwords, 59, 60 Sustainability, 78, 140 Symbiosis, 31, 33, 45, 46, 55, 85, 113, 120, 139, 141 Symbiosis and mutualism, 34

#### **T**

Term frequency, 99 Termite, 109 Text annotation, 58 TF-IDF, 99 To digital knowledge creation, 73 Tokenization, 59, 60 Topic coherence, 103 Topic Explorer, 109 Topic modelling, 33, 34, 79, 82–85, 88, 91, 92, 94, 96, 97, 99–101, 105, 109, 112–114, 122, 142, 143 Topic modelling interface, 115 TopicNets, 109 Topics' interpretability, 109 Trading zones, 50 Transatlantic transfer, 48 Transparency, 31, 76, 78, 140 Transversality, 31, 44, 45 Transversal Posthumanities, 31 Two cultures, 24, 25

**U**

Uncertainties, 33, 34, 76, 90, 123, 125, 141, 143

UNESCO, 4, 37, 41, 42 United Nations, 78 User experience, 58 UTOPIAN, 109

#### **V**

Vannevar Bush's Memex, 20 Visual display, 31 Visualisation, 34, 76 Visualisation literacy, 94, 105, 111, 118

## **W**

Whiteness, 53 Wireframes, 118 Word collocation, 99 Word2vec, 88